Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,

Slides:

Advertisements

Similar presentations

Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.

Advertisements

IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.

The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.

The Open Language Archives Community: Building a worldwide library of digital language resources Gary Simons, SIL International LSA Tutorial on Archiving.

Building metadata components Dieter Van Uytvanck Max Planck Institute for Psycholinguistics CLARIN-NL Info Session Nijmegen

CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.

CLARIN AAI, Web Services Security Requirements

Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics

ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.

Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.

Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.

From CLARIN Component Metadata to Linked Open Data

The JISC IE Metadata Schema Registry Pete Johnston UKOLN, University of Bath JISC Joint Programmes Meeting Brighton, 6-7 July 2004

JOINING UP GOVERNMENTS EUROPEAN COMMISSION ADMS-enabled exploration of GS1 Dox 20 February 2013.

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.

The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.

UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.

CLARIN Centers for a Sustainable Infrastructure Daan Broeder, MPI for Psycholinguistics Jan Odijk, Utrecht University.

Chinese-European Workshop on Digital Preservation, Beijing July 14 – Chinese-European Workshop on Digital Preservation Beijing (China), July.

CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.

CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.

Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.

CLARIN web services and workflow Marc Kemps-Snijders.

The role of metadata schema registries XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN.

The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.

Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.

ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions

CLARINO WP2 National Registry and Long- Term Archiving Freddy Wetjen and Oddrun Pauline Ohren National Library of Norway Bergen, 12. September 2013.

The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.

Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.

Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.

Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.

CMDI Component Registry Patrick Duin Max Planck Institute for Psycholinguistics 2011.

CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.

CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.

LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.

Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.

Lifecycle Metadata for Digital Objects (INF 389K) September 18, 2006 The Big Metadata Picture, Web Access, and the W3C Context.

The JISC IE Metadata Schema Registry and IEEE LOM Application Profiles Pete Johnston UKOLN, University of Bath CETIS Metadata & Digital Repositories SIG,

Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)

CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,

Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006.

11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.

CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.

Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.

A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.

Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,

CMDI Software Components. MD Service Delivers services for the Catalog & Search GUI – Query – Populate UI Acts as a WS and exposes the query and “queryModel()*”

Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.

Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.

1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.

Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.

CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman

Registry of MEG-related schemas MEG BECTa, Coventry, 17 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN is supported by:

Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.

ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.

A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.

TDS-Curator DANS MPI for Psycholinguistics Utrecht Institute of Linguistics OTS languagelink.let.uu.nl/tds/ 9/21/20101CLARIN-NL - Call 1 - ISOcat status.

Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.

AAI needs of the Distributed Computing Infrastructures - CLARIN Dieter Van Uytvanck Max Planck Institute for Psycholinguistics

IPDA Registry Definitions Project Dan Crichton Pedro Osuna Alain Sarkissian.

Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.

CLARIN Federated Identity Vision

Outline Pursue Interoperability: Digital Libraries

Session 2: Metadata and Catalogues

Metadata in Digital Preservation: Setting the Scene

Presentation transcript:

Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen

Metadata for Language Resources  Research at the MPI for Psycholinguistics is performed with Language Resources:  (classical) text corpora,  multi-modal/multi-media recordings  Lexica possibly with multi-media extensions  Everything that can help study language  10 years ago started using metadata to try control the mounting chaos caused by increasing numbers of resources  Reuse, management, metadata itself is also valuable resource  What to use?  DCMI not specific enough and alien terminology  TEI too complicated and then had no support for media

Metadata for Language Resources  Developed IMDI, a dedicated metadata set for multi- media/multi-modal Language Resources  Flexibility: special profiles for Sign-Lang., SL Acquisition,…  Specializations for descriptions at the data set level.  Crosswalks to and from DC/OLAC  Currently the IMDI catalog contains about metadata records for > resources  Also harvested from external IMDI metadata providers  Currently working on harvesting IMDI records for CMU’s Childes & Talkbank corpus resources.  However the EU CLARIN project that is aimed at creating an integrated infrastructure for Language Resources allowed us to reconsider the situation.

Current LR Metadata Situation When starting the CLARIN project we saw a fragmented landscape  Metadata sets, schema & infrastructures in our domain:  IMDI, OLAC/DCMI, TEI  Problems with current solutions:  Inflexible: too many (IMDI) or too few (OLAC) metadata elements  Limited interoperability (both semantic and functional)  Problematic (unfamiliar) terminology for some sub- communities.  Limited support for LT tool & services descriptions

Metadata Components CLARIN chose for a component approach: CMDI  NOT a single new metadata schema  but rather allow coexistence of many (community/researcher) defined schemas  with explicit semantics for interoperability How does this work?  Components are bundles of related metadata elements that describe an aspect of the resource  A complete description of a resource may require several components.  Components may contain other components  Components should be designed for reusability

Metadata Components Technical Metadata Sample frequency Format Size … Lets describe a speech recording

Metadata Components Language Technical Metadata Name Id … Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Sex Language Age Name … Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location … Continent Country Address Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project … Name Contact Lets describe a speech recording

Metadata Components Language Technical Metadata Actor Location Project Metadata schema Metadata description Lets describe a speech recording Component definition XML W3C XML Schema XML File Profile definition XML Metadata profile

Country dcr:1001 Language dcr:1002 Location Country Coordinates Actor BirthDate MotherTongue Text Language Title Recording CreationDate Type Component registry BirthDate dcr:1000 ISOcat concept registry user Dance Name Type Semantic interoperability partly solved via references to ISO DCR or other registry Selecting metadata components & profiles from the registry Title: dc:title DCMI concept registry CMDI Explicit Semantics User selects appropriate components to create a new metadata profile or an existing profile ISOCat or ISO DCR implementation of ISO standard for data categories under control of the linguistic community ISO TC37 Metadata is just one of the seven “thematic domains”

Recording CreationDate Type Component registry Genre 1 dcr:1020 Language dcr:1002 Genre 2 dcr:1030 Dance Name Type Relation Registry Text 1 Language Title Genre1 Text 2 Language Title Genre2 ISOCat Relation Registry User MD search User selects or creates a profile that specifies relations between DCs dcr:1020 = dcr:1030 dcr:1020 ~ dcr:1030 dcr:1020 > dcr:1030 Metadata modelers or terminology expert can also use the RR to specify relations that the ISO DCR can’t store

CMDI Architecture MD Comp. Editor MD Comp. Registry ISO-Cat DCR MD Editor. Local MD Repository OAI-PMH Data provider OAI-PMH Service Provider CLARIN Joint MD Repository MD Services Semantic mapping Services Relation Registry MD Catalog user Metadata modeler ISO TDG MD Creator External agents

Metadata Components & Semantic Granularity  Problems with component metadata: too high granularity in the ISOCat  Actor.Name, Actor.Fullname, Actor.Address, Actor. ,…  Creator.Name, …, Creator. ,…  Funder.Name, …,Funder.  Having a DC for every of these MD elements would explode the ISOcat. Using just generic “Name” loses precision. 1.Compromise: use fine granularity only for elements that are expected to be often used (CreatorName, ActorName) for searching in metadata. Map the rest to generic “Name” 2.More fundamental solution: Use container concepts:  create an “Actor” DC, then we can reason with the context.  Actor ~ Participant, Name ~ Fullname -> Actor.Fullname ~ Participant.name

Query Scenario’s I  Keyword search with regexps  Searching for “Mandarin” will give you all resources in and about Mandarin  Semantic mapping is possible if a keyword is present in a concept registry.  Query for: “discourse” can return also records that have “dialog”  It looks “very” useful to have vocabularies available as pick lists in a taxonomy tree

Query Scenario’s -Terminology What terminology to use in the search interface?  The “canonical” one from the ISOcat  Might be unknown to the user  Some standard ones like OLAC, IMDI, TEI  To provide “compatibility” with these existing frameworks  The terminology used in the metadata components  A user should be able to use the same terms for retrieving resources as he used when creating the metadata  Then he should at least retrieve what he put in.

Metadata quality  In the end all will fail if we cannot provide high quality metadata.  See the VLO presentation tomorrow for an idea about the current status  Collaborative effort of researchers, funders, archive managers and infrastructure & tool builders  Researchers (still) need to be convinced that it is worthwhile  Funders need to allow them the time  Archive management need to audit, evaluate, curate  Tools and infrastructure need to make it all easy as possible

Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n°

Query Scenario’s II  Consider the following queries Participant.name = ‘xxx’ Actor.fullname = ‘xxx’ DCR: participant_name == Particpant.name DCR: participant_name == Actor.fullname  Context problem solved by high DCR granularity  If you extrapolate this strategy requires a lot of entries in the DCR.

Query Scenario’s III  Consider the following queries Participant.name = ‘xxx’ Actor.fullname = ‘xxx’  Limited granularity with registration of container concepts, perhaps low semantic precision DCR:name == fullname myDCR:Actor == Participant

Query Scenario’s IV  Consider the following queries Participant.name = ‘xxx’ Actor.fullname = ‘xxx’  Precise semantics, low granularity DCR:name == name DCR:fullname == fullname myDCR:Actor == Actor myDCR:Particpant == Participant RR: Participant isKindofA Actor RR: fullname isKindofA name

Query Scenario’s V  Consider the following queries Participant.function = ‘annotator’; Participant.name = ‘xxx’ Actor.role = ‘creator’; Actor.name = ‘xxx’ But now map also to: Annotation.Creator = ‘xxx’ ???

Content  IMDI history  CMDI basics, components, ISOCat  Context independency  Metadata for Aggregations  Virtual collection registry  Profile matching via metadata  Caveat: metadata quality – fighting entropy