CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

DC2001, Tokyo DCMI Registry : Background and demonstration DC2001 Tokyo October 2001 Rachel Heery, UKOLN, University of Bath Harry Wagner, OCLC
DC Architecture WG meeting Monday Sept 12 Slot 1: Slot 2: Location: Seminar Room 4.1.E01.
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a, Menzo Windhouwer a, Sue Ellen Wright b a Max Planck Institute for.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
11 CLARIN? ISOCAT! Ineke Schuurman ISOcat content coördinator CLARIN-NL Amsterdam
MP IP Strategy Stateye-GUI Provided by Edotronik Munich, May 05, 2006.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
ISOcat: known issues 10 May /20111CLARIN-NL ISOcat workshop.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
Data Category specifications 20 March 20121CLARIN-NL ISOcat workshop.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
CLARIN-NL ISOcat workshop 2011 part 2 Ineke Schuurman Menzo Windhouwer.
The role of metadata schema registries XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
DC specifications or “Do’s and don’ts” when creating a DC.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
SDMX Standards Relationships to ISO/IEC 11179/CMR Arofan Gregory Chris Nelson Joint UNECE/Eurostat/OECD workshop on statistical metadata (METIS): Geneva.
ISOcat: known issues 20 June 20131CLARIN-NL ISOcat workshop.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
CLARIN-NL ISOcat workshop 2012 part 2 ( ) Ineke Schuurman Menzo Windhouwer.
ISOcat: known issues 19 June 20121CLARIN-NL ISOcat workshop.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
RELATORS, ROLES AND DATA… … similarities and differences.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
CLARIN-NL Requirements and Desiderata Jan Odijk CLARIN-NL Call 3 Info-session Utrecht, 25 Aug 2011.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands TLA/MPI requirements for a Semantic Registry.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
The ISO Data Category Registry ISO 12620:2009 introduces – A web-based electronic Data Category Registry (DCR) for simple, complex and (in the future)
ISOcat status
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
1 CLARIN? ISOCAT! Ineke Schuurman Hilversum,
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Sources of inspiration Discussions in DFT Use Cases Discussions in DF Use Cases „Paris“ Document Comments on „PARIS“ document Urgently need “Basic and.
Synchronise work on DEXs and reference data between PLCS pilots and OASIS/PLCS Workshop #3 10 – 11 November 2004.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
ISOcat tutorial DCR data model and guidelines. Simple and complex DCs Simple Data CategoryComplex Data CategoryConceptual Domain Data CategoryDescription.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
ISOcat: How to create a DC (including “do’s and don’ts”) 20 June 20131CLARIN-NL ISOcat tutorial.
TDS-Curator DANS MPI for Psycholinguistics Utrecht Institute of Linguistics OTS languagelink.let.uu.nl/tds/ 9/21/20101CLARIN-NL - Call 1 - ISOcat status.
Linking to Linguistic Data Categories in ISOcat Menzo Windhouwer a, Sue Ellen Wright b a The Language Archive - MPI for Psycholinguistics,
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Relations between Data Categories
Metadata Framework as the basis for Metadata-driven Architecture
Some Options for Non-MARC Descriptive Metadata
Attributes and Values Describing Entities.
Presentation transcript:

CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht University Menzo Windhouwer The Language Archive - DANS ISOcat meeting 10 December 2013 Utrecht, The Netherlands

Outline  CLARIN’s use of ISOcat  CLARIN(-NL) experiences and requirements  Data model  Proces  User interface

CLARIN’s use of ISOcat - 1  Component Metadata (CMD) uses ISOcat as a Concept Registry  CMD profiles, components, elements, attributes and values can all link to ISOcat DCs in  The ComponentRegistry editor allows to search in ISOcat and find relevant DCs  It tries to steer the user to right DC type for a CMD context, but users can also copy a PID directly into the ConceptLink  Types do frequently mismatch  The link between ISOcat and CMDI is weak, i.e., when a DC is selected none of the specification information (data type, value domain) is taken over  Representation information is seen as a hint  The ConceptLinks are used by the VLO (a faceted search tool) for ‘automatic’ mappings from CMD records to facetsmappings  This makes the flexibility in CMDI for the use of many different metadata structures and terminology managable for generic tools  DCs on the component level become more and more important for disambiguation as they provide context

CMDI and ISOcat - 1 OAI-PMH Data provider OAI-PMH Service provider Local metadata repository Joint metadata repository metadata modeler metadata user metadata creator component registry & editor metadata editor metadata curator metadata curator metadata catalogue Relation Registry search & semantic mapping DATA ISOcat

CMDI and ISOcat - 2  Desired mapping CMD types to DC types  CMD profile  container Data Category  CMD component  container Data Category  CMD element  complex Data Category  CMD attribute  complex Data Category  CMD value  simple Data Category  Due to the flexible nature of CMD this can potentially lead to many semantically equivalent DCs, but with different types

CLARIN use of ISOcat - 2  Content can also be semantically annotated with DCs  CLARIN-NL call projects are required to do so  Supported by yearly workshops  The CLARIN-NL/VL group bundles this work  Some CLARIN national initiatives have created DCs for tagsets  Netherlands: CGN  Poland: NKJP  Germany: STTS  Spain: not yet, but asked for info  Some CLARIN related groups have used/created DCs for ISO TC 37 standards:  Uby/Cornetto: LMF  SHEBANQ: LAF/GrAF  These annotations could be exploited by (federated) search engines  Currently level 0 (full text search) is working (no DCR involvement)  Drafts for level 1 included ISOcat and RELcat interaction (at least for the metadata part)  Needs an search indexing engine on the center side that understand DC annotated resources

CLARIN use of ISOcat - 3  In absence of standardized DCs the CLARIN ERIC suggested to appoint national ISOcat (and or CMDI) coordinators to streamline the process of ISOcat usage and DC creation  They have been appointed, but they have not yet met  The CLARIN-NL/VL experience is important input for this coordination effort

Data model - 1 ISO ISO 12620:2009

DC types - 1 writtenForm string open grammaticalGender string neuter masculine feminine closed simple: string constrained complex:

DC types - 2 language alphabet writtenForm japanese ipa lexicon entry lemma container:

Data model - 2  Proliferation due to types:  Different uses of a concept in a data model might lead to different representations  in part-of-speech = “verb” /verb/ is a simple data category  in verb = “to walk” /verb/ is an open data category  in both cases the semantics might be the same  The DCR data model doesn’t have provisions to share a semantic core (a concept)  So users have to recreate data categories because they need another type  This leads to proliferation and makes it hard for users to select the right DC or to keep semantics in sync across types  Many users will just use the wrong type 

Data model - 3  Conflicts with actual use  (metadata) resources point to DCs from a specific context  CMD component should point to a container DC, but this happens not always and ISOcat has no way to enforce it  An Relax NG element should point to a simple DC, but this can’t be enforced by ISOcat  …  The ISOcat DCR has no way to enforce proper use of DC types within the resources  It can steer in the form of XSD, RNG, FSD, … templates  But in most (all CLARIN) cases a schema already exists  In all cases the resource context provides the actual and thus correct representation information

Embed DCs in RNG ipa … An XML attribute implies a complex DC An value implies a simple DC

Embed DCs in a FSR (LMF/LAF/TEI) A feature name implies a complex DC A feature value implies a simple DC

DC mismatches in CMDI  469 simple DCs:  are linked to 165 CMD elements  are linked to 72 CMD components  631 complex DCs  are linked to 778 CMD components  59 container DCs  are linked to 4 CMD elements

Data model - 4  A rare blend of expertise  To create a good quality DC specification one needs to be able  To provide a good definition, i.e., have domain expertise  Pick the right DC type and data type, i.e., have technical insight  This is a rare combination and often spread over multiple persons in a project with different roles and these might not come together to create a quality DC specification  And even technical users are inclined to ignore DC types and select DCs based on matching semantics  The CMDI metadata modelers are in most cases more technical oriented, still they use the wrong DC types  Which leads to conflicts between the specified DC representation and its use in the actual resource context

Which DC type to use?  Which type is appropriate depends on the place of the data category in the structure of your resource: 1.Can it have a value?  Complex Data Category with an data type  Any of the values of the data type?  Open Data Category  Can you enumerate the values?  Closed Data Category  Fill its value domain with simple Data Categories  Is there a rule to constrain the values?  Constrained Data Category  Express the rule/constraint in one of the rule languages 2.Is it a value?  Simple Data Category 3.Does it group other (container or complex) Data Categories?  Container Data Categories  If a Data Category both has a value and groups Data Categories  Complex Data Category  (or two different Data Categories: one container and one complex)

Examples categorynoun phrase agreement person numbersingular third S NPNP VP V NPNP DetN Text=“John” Text=“hit” Text=“the”Text=“ball” /category/ a closed DC /noun phrase/ a simple DC /agreement/ a container DC /number/ a closed DC /singular/ a simple DC /person/ a closed DC /third/ a simple DC (Encoded as TEI P5 FSR the XML elements and attributesTEI P5 FSR are seen as syntactic sugar (or are semantically annotated on a next (meta) level) /S/ a container DC /NP/ an open DC /VP/ a container DC /V/ an open DC /NP/ a container DC /Det/ an open DC /N/ an open DC (Text is seen as syntactic sugar) … N(soort,ev,basis,zijd,stan) … XSD /text/ a container DC or an open DC /tag/ a constrained DC (points to EBNF) EBNF /PoS/ is a closed DC /N/ is a simple DC /NTYPE/ is a closed DC /soort/ is a simple DC … (Use EBNF to go into (P)CDATA that contains additional structure/semantics)

Data model - 5  These experiences:  Proliferation due to types  Conflicts with actual use  A rare blend of expertise is needed  And the insight that the actual representation can be, and has to be as the DCR can’t enforce conformance, derived from the resource context where the DC is used  Lead to the proposal to drop all types from the data model  Are these than still DCs, maybe not and CLARIN actually needs a (lightweight) Concept Registry (as ISOcat was already often marketed within CLARIN)  the differences are too subtle for most users, so in many/most cases they will have entered ‘concept’ definitions  but for consistency we’ll keep on using the term DCs in this presentation

Data model - 6  Other proposed changes:  Remove the standardization information: leave the process to the implementation (recommendations)  Remove Linguistic Sections: they are underused and currently hard to maintain (would require language specialist to have edit access)  Replace DENs by Also Referred To As (ARTA) table: for mapping purposes we’re interested in more then names  Replace identifier by a required ARTA entry: /identifier/ is confusing as it isn’t unique, the DC PID is unique  So don’t remove it, just merge it with the DEN/ARTAs  Consistent use of /source/: currently DENs use /source/ different then the Definition Section, maybe use in the DEN/ARTA entry /origin/  Allow only one Definition Section: currently the official data model allows multiple Definition Sections for a Language Section if this happens a DC is ambiguous, so only one should be allowed (ISOcat already checks for that)  Indicate successors of superseded names: currently names can get the status superseded, but its impossible to indicate by which name; this should become possible or names should just be deprecated

Data model - 7

Data model - 8  In this light weight model no relationships between DCs are stored anymore:  In line with the previous insight that ontological relationships are too application/domain specific  Representation-based information and relationships are actually as application/domain specific  Still relationships are interesting information  Store them in application/domain specific relation sets in the Relation Registry  Needed for disambiguation and full semantic description  which DC (from which theoretical framework) does this concept in the definition refer to?  which abstract concept (which would most likely never by a DC, and certainly has a hard to determine DC type) does this definition refer to?  Link to the right DC/concept from the definition  But leave typing to the a application/domain specific relation set in the Relation Registry

Data model - 9  Alternative data models (always a PID):  SKOS  By combination with the Relation Registry  Name (1) -> {description, lang} (N), {keyname,value} (N)  A term base?  Experience shows that a bit more is needed for guiding the user, e.g., examples, guidelines  …

Process - 1  Uptake of ISOcat is hampered by the strict ownership of data categories  All changes have to go through owner, unless (s)he shared edit rights with you  What could be made easier?  (Adding a value)  Adding a DEN/ARTA entry  Adding a translation  Only the English Language Section is owned  Adding a new profile membership  Profiles could be replaced by tagging  Publish DCs by a coordinator  If the owner hands over a DC to a coordinator, (s)he implicitly indicates that (s)he finds the DC ready for publication and so the coordinator can do it on his/her behalf

Process - 2  One can take openness to an extreme with a wiki approach  Stable semantics could be still be there due to giving every revision a PID  But that might be a too fine granularity  A versioning policy managed by the user  Semantic drift might be uncontrolled  As the user finds updating his (new) resources to use a new PID too cumbersome  Might be a task for a coordinator  Transfer of ownership should be possible  Triggered by a coordinator if an owner becomes unresponsive

User interface - 1  ISOcat uses the General Interface RIA framework, which gets old and is not actively developed anymore  Time for a more modern approach, e.g., Bootstrap/Angular  Wiki-like approach, i.e., more text oriented instead of the form oriented approach  However, an existing Semantic Registry framework will most likely come with its own framework  Which might not (out-of-the-box) support functionality we do like in the ISOcat UI (if any ;-)

CLARIN requirements  CLARIN needs a Semantic Registry  It more likely needs a Concept Registry than a Data Category Registry  As actual representation information is provided by the resource context  Users do  only occasionally use the Semantic Registry so it needs to be intuitive and not too complex (data model and UI)  not have much technical expertise (in general), so providing correct representation information is a hard task for them  Some (perceived) proliferation is unavoidable  due to different theoretical frameworks  so disambiguation of concepts mentioned in definitions needs to be done, i.e., concepts specific to theories can’t be mixed and matched  but should still be avoided where possible “be as generic as possible and as specific as needed”

Interesting ideas  ConceptWiki knowlets that group near same concepts  conceptwiki.org only supports authorities, community involvement is currently disabled  When to start a new knowlet, i.e., when has a concept drifted off too much?  RDA DFT StackExchange experimentStackExchange experiment  Don’t provide a PID to the ‘knowlet’, but give each entry a PID and show them in the knowlet context (during search)  EUDAT semantic services  API to search in authorities for a matching concept and an ‘overflow’ lightweight semantic store  When no match is found add an entry to the lightweight store  This lightweight store provides uncurated entries, which can be picked up by authorities (bottom up, grass roots)

Possible new setups  Lightweight ISOcat (ISO TC37 RA, TLA)  DC types and relations move to RELcat  /noun/ dcr:hasType /simple/  /noun/ dcr:hasConceptBase  /PoS/ dcr:hasType /closed/  /PoS/ dcr:hasConceptBase  /noun/ dcr:isPossibleValueOf /PoS/  With he dcr:* properties are from a DCR specific RDF vocabulary  Full scale ISOcat (ISO TC37 RA)  Contains the ISO TC37 standardized DCs  Less open DCR then now, i.e., just ISO TC37 experts  Carefully curated sets of DCs  New lightweight semantic registry (CLARIN, EUDAT, TLA)  Open for everyone to register ‘new’ semantics  Uncurated concepts/terms/… (informal)  Authorities can see what bubbles up from the communities