CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.


Similar presentations
Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop.

Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
OAI from 50,000 Feet OAI develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. Begun in 1999.
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
OAForum – September 2003 Muriel Foulonneau Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector Muriel.
IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
White Paper on Establishing an Infrastructure for Open Language Archiving Steven Bird and Gary Simons.
OLAC Process and OLAC Protocol: A Guided Tour Gary F. Simons SIL International ___________________________ OLAC Workshop 10 Dec 2002, Philadelphia.
DRIVER Building a worldwide scientific data repository infrastructure in support of scholarly communication 1 JISC/CNI Conference, Belfast, July.
Building metadata components Dieter Van Uytvanck Max Planck Institute for Psycholinguistics CLARIN-NL Info Session Nijmegen
UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
CLARIN AAI, Web Services Security Requirements
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
CLARIN Technical Infrastructure How far are we?. Short Overview CLARIN is one of the 44 accepted ESFRI Roadmap Initiatives official start: , Kick-off:
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
From CLARIN Component Metadata to Linked Open Data
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
CLARIN Centers for a Sustainable Infrastructure Daan Broeder, MPI for Psycholinguistics Jan Odijk, Utrecht University.
Populating the Infrastructure using Standards Daan Broeder CLARIN NL EB TLA - MPI for Psycholinguistics CLARIN Coordinators Meeting June 29,30 Budapest.
CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive Data Archiving and Networked Solutions
Using IESR Ann Apps MIMAS, The University of Manchester, UK.
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
CMDI Component Registry Patrick Duin Max Planck Institute for Psycholinguistics 2011.
CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CMDI Software Components. MD Service Delivers services for the Catalog & Search GUI – Query – Populate UI Acts as a WS and exposes the query and “queryModel()*”
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
ISOcat status
Digitization – Basics and Beyond workshop Interoperability of cultural and academic resources New services for digitized collections Muriel Foulonneau.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Data Citation Implementation Pilot Workshop
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
ISWG / SIF / GEOSS OOS - August, 2008 GEOSS Interoperability Steven F. Browdy (ISWG, SIF, SCC)
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
CLARIN Federated Identity Vision
VI-SEEM Data Repository
Session 2: Metadata and Catalogues
JISC Information Environment Service Registry (IESR)
Presentation transcript:

CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CLARIN Project The CLARIN project is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers. CLARIN EU project and different national CLARIN projects CLARIN EU WP2 since 2007 investigated and creates (prototypical) solutions for: Common AAI infrastructure Single system of persistent identifiers (PIDs) for resources Common metadata domain …

Current Metadata Situation Fragmented landscape Metadata sets, schema & infrastructures in our domain: IMDI, OLAC/DCMI, TEI Problems with current solutions: Inflexible: too many (IMDI) or too few (OLAC) metadata elements Limited interoperability (both semantic and functional) Problematic (unfamiliar) terminology for some sub- communities. Limited support for LT tool & services descriptions

Metadata Components CLARIN chose for a component approach: CMDI NOT a single new metadata schema but rather allow coexistence of many (community/researcher) defined schemas with explicit semantics for interoperability How does this work? Components are bundles of related metadata elements that describe an aspect of the resource A complete description of a resource may require several components. Components may contain other components Components should be designed for reusability

Metadata Components Technical Metadata Sample frequency Format Size … Lets describe a speech recording

Metadata Components Language Technical Metadata Name Id … Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Sex Language Age Name … Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location … Continent Country Address Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project … Name Contact Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project Metadata schema Metadata profile Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project Metadata schema Metadata description Lets describe a speech recording Metadata profile

Metadata Components Language Technical Metadata Actor Location Project Metadata schema Metadata description Lets describe a speech recording Component definition XML W3C XML Schema XML File Profile definition XML Metadata profile

Location Country Coordinates Actor BirthDate MotherTongue Text Language Title Recording CreationDate Type Component registry user Dance Name Type User selects appropriate components to create a new metadata profile or an existing profile Selecting metadata components from the registry CMDI Component Reuse

Country dcr:1001 Language dcr:1002 Location Country Coordinates Actor BirthDate MotherTongue Text Language Title Recording CreationDate Type Component registry BirthDate dcr:1000 ISOcat concept registry user Dance Name Type Semantic interoperability partly solved via references to ISO DCR or other registry Selecting metadata components from the registry Title: dc:title DCMI concept registry CMDI Explicit Semantics User selects appropriate components to create a new metadata profile or an existing profile

Recording CreationDate Type Component registry Genre 1 dcr:1020 Language dcr:1002 Genre2 dcr:1030 Dance Name Type Relation Registry Text 1 Language Title Genre1 Text 2 Language Title Genre2 ISOCat Relation Registry User MD search User selects or creates a profile that specifies relations between DCs dcr:1020 = dcr:1030 dcr:1020 ~ dcr:1030 dcr:1020 > dcr:1030 Metadata modelers or terminology expert can also use the RR to specify relations that the ISO DCR cant store

CMDI Metadata Live-cycle Search Service Joint Metadata Repository Metadata Repository Metadata Repository Relation Registry ISOcat Concept Registry DCMI Concept Registry other Concept Registry CLARIN Component Registry/Editor Semantic Mapping Create metadata schema from selection of existing components. Allow creation of new components if they have references to ISOcat Perform search/browsing on the metadata catalog using the ISO DCR and other concept registries and CLARIN relation registry Metadata component profile was selected from metadata component registry Metadata harvesting by OAI-PMH protocol Metadata descriptions created

CMDI: Browsing the Component Registry

CMDI: Editing a Component

MD Components & Semantic Granularity Problems with component metadata: too high granularity in the ISOCat Actor.Name, Actor.Fullname, Actor.Address, Actor. ,… Creator.Name, …, Creator. ,… Funder.Name, …,Funder. Having a DC for every of these MD elements would explode the ISOcat. Using just generic Name loses precision. 1.Compromise: use fine granularity only for elements that are expected to be often used (CreatorName, ActorName) for searching in metadata. Map the rest to generic Name 2.More fundamental solution: Use container concepts: create an Actor DC, then we can reason with the context. Actor ~ Participant, Name ~ Fullname -> Actor.Fullname ~

Metadata Thematic Domain DCs to describe Language Resources & Technology Chair: Peter Wittenburg, MPI for Psycholinguistics Started entering data 2009 based on two expert meetings in Athens and guided by: existing metadata sets: IMDI, OLAC/DC and inventory: ENABLER resulted in: 218 DCs Translation work was initiated via the CLARIN national coordinators. (15 language sections for audio file format) Dutch CLARIN metadata project 2010 added: 76 new DCs (of which 30 still private)

Some experiences I The GUI is not too fast Need a discussion platform to discuss a DCs attributes. (now solved with the forum function) UI arrangement. For instance the value domain attributes are not in one panel. (type, data type, value domain Metadata terms often needed to be linked to DCs that are either too broad or too narrow. (Situation did not merit a new DC). Search for existing DCs is only effective if you know the terminology.

Some experiences II Duplicate entries (e.g. source). Entry was made before check was in place. illogical or unsystematic definitions: DC-2512: The name of the person who was participating in the creation project. DC-2454: The name of the person that can be contacted to get access to the resource or to the tool/service. DC-2505: The address of an organization that was/is involved in creating, managing and accessing resource or tool/service. DC-2521: The address of a person or an organization that is involved in creating, managing or accessing resources or tools/services. DC-2459: The organization that was leading the creation project or that is responsible for accessing the resource and the contact person is affiliated with. DC-2461: The telephone number of a person or an organization that is involved in creating, managing or accessing the resource.

Next Steps for Metadata TD Expected standardization: October 2010 Before that we will reexamine all DCs Build jury -> vote DCR board -> vote What happens then? The DCs will get new PIDs But we have metadata records where the PIDs of old DCs are used Curate. Update the metadata records Redirect (if owner agrees) to standardized version Make use of Relation registry: old_DC == new_DC ?

Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n°

WS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic DomainsWS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic Domains

CMDI Architecture I Division into: MD Producer components MD Exploitation or consumer components OAI-PMH components Knowledge components: DCR, Relation Registry The CMDI takes an archivist or production first viewpoint Prioritize that the metadata can be of good quality: consistent, coherent, correctly linked to the concept registries The consumer side can be more experimental and diverse. Many MD exploitation stacks or consumers can work in parallel on the same metadata

Concept registries Basically a list with concepts and their descriptions where every concept has a unique identifier. Some have a complicated structure and are associated with elaborate (administrative) processes to determine the status and acceptation of concepts in the registry. e.g. ISO- DCR. others are static and simple lists of concepts and descriptions e.g. DCTERMS

ISO DCR ISO-DCR is important for more CLARIN objectives then metadata and is under control of the linguistic community (ISO-TC37) is an implementation of the model defined in ISO 12620, offering a GUI and programming APIs Every DC Is subject to a standardization process and carries information on the status of that process Metadata is just one of 13 Thematic Domains in the DCR Can contain no relations between the DCs, only a value domain relation is possible.

CMDI Architecture II MD Comp. Editor MD Comp. Registry ISO-Cat DCR MD Editor. Local MD Repository OAI-PMH Data provider OAI-PMH Service Provider CLARIN Joint MD Repository MD Services Semantic mapping Services Relation Registry MD Catalog user Metadata modeler ISO TDG MD Creator External agents Virtual Collection Registry

Current CMDI status I ISO-DCR: 218 metadata concepts CMDI component registry: 135 components, 19 profiles Produced & inspired by: Deconstructing existing metadata schema IMDI, OLAC, TEI Considering requirements of other CLARIN activities like profile matching CLARIN NL metadata project tested the CMDI model and delivered components and profiles for the resources in two major Dutch Language Resource centers

Current CMDI status II Operational or test phase: ISOCat DCR Component registry & editor ARBIL metadata editor Still working on: Joint Metadata Repository, Metadata Catalog, Semantic Mapping, Relation Registry Expect a usable first version in third quarter 2010

CMDI contributors Collaboration on the CMDI implementation MPI for Psycholinguistics: metadata modeling and editing facilities Språkbanken, University of Gothenburg: Joint CLARIN metadata repository Austrian Academy: Metadata catalog, metadata & semantic mapping services IDS: Virtual Collection Registry MPG / CLARIN NL: ISO-DCR DFKI: Relation Registry

Common metadata domain Why a common metadata domain: Finding and sharing resources housed at all archives & repositories participating in CLARIN Specify distributed heterogeneous collections of LRs and processing these collections In general, a common metadata domain helps bringing along a single domain of LRs

Recording CreationDate Type Component registry Genre 1 dcr:1020 Language dcr:1002 Genre2 dcr:1030 Dance Name Type Relation Registry Text 1 Language Title Genre1 Text 2 Language Title Genre2 ISOCat Relation Registry MD search user User selects or creates a profile that specifies relations between DCs dcr1020 = dcr1030 dcr1020 ~ dcr1030 dcr:1020 > dcr:1030 MD modeler

Recording CreationDate Type Component registry Genre 1 dcr:1020 Language dcr:1002 Genre2 dcr:1030 Dance Name Type Relation Registry Text 1 Language Title Genre1 Text 2 Language Title Genre2 ISOCat