From CLARIN Component Metadata to Linked Open Data

Slides:



Advertisements
Similar presentations
Presented to the ALCTS FRBR Interest Group, ALA Annual, 24 June 2011
Advertisements

Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
UKOLN is supported by: The JISC Information Environment Metadata Schema Registry (IEMSR): Update DC-2006, Manzanillo, Mexico October 3-6, 2006 Rachel Heery.
UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
February Harvesting RDF metadata Building digital library portals with harvested metadata workshop EU-DL All Projects concertation meeting DELOS.
The Institute for Learning and Research Technology is a national centre of excellence in the development and use of technology-based methods in teaching,
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
DANS is an institute of KNAW and NWO Data Archiving and Networked Services EASY Dublin Core and CMDI Georgi Khomeriki, Marnix van Berchum, Menzo Windhouwer.
ISOcat Data Category Registry Defining widely accepted linguistic concepts Menzo Windhouwer 1CLARIN-NL MD tutorial, September 2009.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
RDF Schemata (with apologies to the W3C, the plural is not ‘schemas’) CSCI 7818 – Web Technologies 14 November 2001 Van Lepthien.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Kim
ESDSWG2011 – Semantic Web session Semantic Web Sub-group Session ESDSWG 2011 Meeting – Semantic Web sub-group session Wednesday, November 2, 2011 Norfolk,
Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI.
Supported by EU projects 12/12/2013 Athens, Greece Open Data in Agriculture Hands-on with data infrastructures that can power your agricultural data products.
Semantic Web Introduction
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
UKOLN is supported by: OAI-ORE a perspective on compound information objects ( Defining Image Access.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
CLARIN web services and workflow Marc Kemps-Snijders.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Using IESR Ann Apps MIMAS, The University of Manchester, UK.
RDA data and applications Gordon Dunsire Presented to staff of the British Library, Boston Spa, 20 Mar 2014.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.
DASISH Metadata Catalogue Binyam Gebrekidan Gebre, Stephanie Roth, Olof Olsson, Catharina Wasner, Matej Durco, Bartholemeus Worcslav, Przemyslaw Lenkiewicz,
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Linguistics with CLARIN Storing resources in CLARIN Jan Odijk LOT Winterschool Amsterdam,
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
CLARIN for Linguists Portal & Searching for Resources Jan Odijk LOT Summerschool Nijmegen,
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands NP CMDI-1 Metadata Component Framework New Standardization.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
Metadata Registries Registry: authoritative, centrally controlled store of information – W3C Web Services Glossary, 2004
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
CLARIN Requirements for a Semantic Registry Daan Broeder The Language Archive – MPI Ineke Schuurman CLARIN-NL/VL – KU Leuven & Utrecht.
EDM Europeana Data Model Guus Schreiber with input from Carlo Meghini, Antoine Isaac, Stefan Gradmann, Maxx Dekkers et al. from Europeana V1.
Pete Johnston, Eduserv Foundation 16 April 2007 An Introduction to the DCMI Abstract Model JISC.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Current metadata standards ISO/IEC MLR new for Curriculum Erlend Øverby’ Chair ISO/IEC JTC 1/SC 36 Information Technology for Learning, Education and Training.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
CMD and TEI CMDI interoperability workshop Utrecht Matej Ďurčo, ICLTT, Vienna.
Linked Library (+AM) Data Presented LITA Next-Generation Catalog IG Corey A Harper Publish, Enrich, Relate and Un-Silo.
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
Linking to Linguistic Data Categories in ISOcat Menzo Windhouwer a, Sue Ellen Wright b a The Language Archive - MPI for Psycholinguistics,
Metadata Issues in Long-term Management of Data and Metadata
Middleware independent Information Service
The Re3gistry software and the INSPIRE Registry
Darja Fišer CLARIN ERIC Director of User Involvement
CLARIN ERIC and the science cloud
Session 2: Metadata and Catalogues
Linked Data Reuse in the Language Services Industry
Presentation transcript:

From CLARIN Component Metadata to Linked Open Data Matej Durco Institute for Corpus Linguistics and Text Technology matej.durco@oeaw.ac.at Menzo Windhouwer The Language Archive - DANS menzo.windhouwer@dans.knaw.nl LDL@LREC 2014 Reykjavik, Iceland

Outline CLARIN Component Metadata CMD 2 RDF Some first experiments Component Metadata Infrastructure (CMDI) CMD 2 RDF Model Profiles and components Instances Some first experiments Conclusions and future work

CLARIN CLARIN = Common Language Resources and Technology Infrastructure = an european ESFRI infrastructure project Aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyze or combine them, independent of where they are located. Building a networked federation of European data repositories, service centers and centers of expertise. One pillar of this infrastructure is a joint metadata domain http://www.clarin.eu/

Component Metadata Infrastructure Rationale for CMDI Limitations of existing metadata schemas (OLAC/DCMI, IMDI, TEI header) Inflexible: too many (IMDI) or too few (OLAC) metadata elements Limited interoperability (both semantic and syntactic) Problematic (unfamiliar) terminology for some sub-communities. Limited support for LT tool & services descriptions CMDI addresses this by: Explicit defined schema & semantics User/project/community defined components http://www.clarin.eu/cmdi/

CMDI - example Lets describe a speech recording Project Location Actor Metadata Profile Project Name Contact Lets describe a speech recording Location Continent Country Address Sex (male, female) Language Age Name Actor Language Name Id (aaa … zzj) Technical Metadata Sample frequency Format Size

CMDI - example Lets describe a speech recording Project Location Actor Metadata Profile Project Lets describe a speech recording Location Actor Metadata schema (W3C XML Schema) Language Technical Metadata Metadata description (XML document)

CMDI - workflow metadata modeler ISOcat metadata catalogue component registry & editor metadata user metadata creator Relation Registry search & semantic mapping metadata editor metadata curator metadata curator Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA

CMDI in CLARIN 2011-01 2012-06 2013-01 2013-06 2014-03 Profiles 40 53 87 124 153 Components 164 298 542 828 1110 Elements 511 893 1505 2399 3101 Distinct Data Categories (DCs) 203 266 436 499 737 Metadata DCs 277 712 774 791 1103 % Elements w/o DCs 24.7% 17.6% 21.5% 26.5% 24,2% CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and META-SHARE have been created Profiles differ a lot in structure: Small and flat profiles with 5 – 10 elements Large and complex profiles of up to 10 component levels with hundreds of elements More than 670.000 CMD records are harvested from around 60 providers http://catalog.clarin.eu/vlo/

CMD Cloud By reusing data categories and components a semantic network is created: a CMD cloud with clusters of related resources CMD cloud poster + demo, Wednesday, P10, 156 The CMD facetted browser (aka VLO) uses this semantic layer to find facet mappings and deal with the diversity of CMD records CLARIN booth, HLT Village CMDI is based on XML Well established core technology in the metadata domain Still with the focus on semantics, lets see how it could look in RDF

CMD 2 RDF To map a CMD record to RDF we need A mapping for the basic component model Basic classes and properties to represent profiles, components, elements, attributes and their relationships and values A mapping for a specific profile or component A specific subclass or subproperty of the basic component model A mapping for specific metadata records Instances of profile or component Embedding in common LOD vocabularies

Component Metadata Model Basic CMD model is described by ISO/DIS 24622-1 1st part of ISO TC 37 SC 4 3 CMD standards family Natural mapping to RDF: Profiles/components to RDF Classes Elements to RDF Properties Complication CLARIN’s CMDI allows attributes on both Components and Elements Elements have to be RDF Classes

CMDM 2 RDF cmdm:contains cmdm:contains cmdm:Component cmdm:Element rdfs:subClassOf cmdm:hasElementEntity cmdm:hasElementValue cmdm:Profile cmdm:Entity cmdm:Value cmdm:hasAttributeEntity cmdm:hasAttributeValue cmdm:Attribute cmdm:containsAttribute cmdm:containsAttribute

CR 2 RDF To foster reuse profiles and components are stored in the Component Registry And its REST API provides them with an URI http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1299509410079 We reuse this URI+’/rdf’ to identify profiles and components Future work: ComponentRegistry will really return the RDF representation

CR 2 RDF (cnt.) A profile or component can have inner components Parameter Name Description Values ParameterValue Value To indicate a specific inner component or element add the dot-path to the profile/root component URI http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1299509410079/rdf#Parameter.Description http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1299509410079/rdf#Parameter.Values.ParameterValue.Description Semantic equivalence of components/elements/attributes/values can be indicated by sharing a ConceptLink (to an ISOcat data category)  dcr:datcat

CR 2 RDF (cnt.) cmdm:Component isocat:DC-2520 rdfs:subClassOf dcr:datcat cmdm:Element cmd-c:Parameter rdfs:subClassOf cmd-c:Parameter.Values cmd-c:Parameter.Description cmd-c:Parameter.Values.ParameterValue cmd-c:Parameter.Values.ParameterValue.Description cmd-c:Parameter.Values.ParameterValue.Value cmd-c:hasParameter.Values.ParameterValue.hasValueElementValue xsd:string

CR 2 RDF (cnt.) If the value domain is an enumeration (like country code) there is an additional has...ElementEntity object property, which refers to the allowed values using their Component-based URI Entities can also have ConceptLinks which can later be used for more extensive mappings Nesting of Components and Elements is just represented in the instance by the generic cmdm:contains property. Missing profile specific subproperty? : cmd-c:Parameter.containsValues rdfs:subPropertyOf cmdm:contains; rdfs:domain cmd-c:Parameter; rdfs:range cmd-c:Parameter.Values.

CR 2 RDF (cnt.) cmdm:Element cmdm:hasElementEntity cmdm:hasElementValue cmdm:Entity cmdm:Value rdfs:subPropertyOf rdfs:subPropertyOf rdfs:subClassOf cmd-c:ISO639.iso-639-1-code cmd-c:ISO639.hasiso-639-1-code ElementEntity cmd-c:ISO639.hasiso-639-1-code ElementValue cmd-c:ISO639.iso-639-1-codeEntity xsd:string a dcr:datcat cmd-c:ISO639.iso-639-1-codeValue.aa cdb:CDB-00130489-001

CMD Record A CMD record consists of A header containing Dublin Core-like metadata A Resource section pointing to The resources being described Other CMD Records (modelling a collection) A landing page A search page The Component section governed by the CMD Profile

Sample CMD record

Record 2 RDF Overall structure: Components follow the CR2RDF structure of their profile and are the body of an Open Annotation The Open Annotation describes the resources (oa:hasTarget) Header elements become Dublin Core properties of the Component root Landing and search pages are properties of the Open Annotation When the CMD record represents a collection (i.e. references other CMD records), it is modelled as a ORE ResourceMap for these other records Every CMD records is wrapped into a separate graph e.g.:http://www.clarin.eu/cmd/BAS_Repository/ oai_BAS_repo_Corpora_aGender_100103.rdf

First tests A sample of ~14.000 CMD records from 18 different providers in 43 different profiles Uploaded to Virtuoso together with the basic model (cmdm) CR2RDF (199 profiles and 877 components) data categories definitions and RR relation sets S(i)ample SPARQL queries: basic facets: records / language, / profile inspect the recursive cmdm:contains predicate list existing organisation names (literals) usage of data categories search via data category (emulate VLO) http://clarin.aac.ac.at/virtuoso/sparql

Future work resolve literals to resource links (outbound links) i.e. has...ElementValue  has...ElementEntity step-by-step for selected predicates Organisations  CLAVAS, ? Persons  GND, VIAF, dbpedia Languages  WALS.info allows to ask for resource for languages with given phenomena (e.g. word-order) ...? A CLARIN-NL project to flesh out CMD2RDF has just started 

CMD2RDF system architecture

Thanks for your attention. Questions. Now or matej. durco@oeaw. ac Thanks for your attention! Questions? Now or matej.durco@oeaw.ac.at menzo.windhouwer@dans.knaw.nl

Sample SPARQL queries PREFIX cmdm: <http://www.clarin.eu/cmd/general.rdf#> PREFIX dcterms: <http://purl.org/dc/terms/> SELECT SAMPLE(?p) as ?profile SAMPLE(?pid) as ?pid COUNT(?i) as ?count WHERE { ?p rdfs:subClassOf cmdm:Profile. ?p dcterms:identifier ?pid. ?i a ?p. } GROUP by ?p ?pid ORDER BY DESC(?count) PREFIX oa: <http://www.w3.org/ns/oa#> PREFIX cmdm: <http://www.clarin.eu/cmd/general.rdf#> SELECT ?elemtype ?value where {?rootcomponent a <http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1290431694579/rdf#LexicalResourceProfile>. ?rootcomponent cmdm:contains* ?comp. ?comp cmdm:contains ?elem. ?elem a ?elemtype. ?elem ?haselemvalue ?value. ?elemtype rdfs:subClassOf cmdm:Element. FILTER( isLiteral(?value)) FILTER( regex(?value,'.')) }

CMDM 2 RDF @prefix cmdm: <http://www.clarin.eu/cmd/general.rdf#>. # basic building blocks of CMD Model cmdm:Component a rdfs:Class . cmdm:Profile rdfs:subClassOf cmdm:Component . cmdm:Element a rdfs:Class . # basic CMD nesting cmdm:contains a rdf:Property ; rdfs:domain cmdm:Component ; rdfs:range cmdm:Component , cmdm:Element . # values cmdm:Value a rdfs:Literal . cmdm:hasElementValue a rdf:Property ; rdfs:domain cmdm:Element ; rdfs:range cmdm:Value . # add a parallel separate class/property for the resolved entities cmdm:Entity a rdfs:Class . cmdm:hasElementEntity a rdf:Property ; rdfs:range cmdm:Entity .

CMDM 2 RDF (cnt.) # Attributes cmdm:Attribute a rdfs:Class . cmdm:containsAttribute a rdf:Property ; rdfs:domain cmdm:Component , cmdm:Element ; rdfs:range cmdm:Attribute . cmdm:hasAttributeValue a rdf:Property ; rdfs:domain cmdm:Attribute ; rdfs:range cmdm:Value . cmdm:hasAttributeEntity a rdf:Property ; rdfs:range cmdm:Entity .

CMDM 2 RDF (cnt.)

CR 2 RDF (cnt.) @prefix cmdm: <http://www.clarin.eu/cmd/general.rdf#>. @prefix <http://www.isocat.org/datcat/>. @prefix cmd-p: <http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1299509410079/rdf#>. cmd-p:Parameter rfds:subClassOf cmdm:Component; rdfs:label “Parameter”. cmd-p:Parameter.Description rfds:subClassOf cmdm:Element; rdfs:label “Description”; dcr:datcat isocat:DC-2520. cmd-p:Parameter.Values rfds:subClassOf cmdm:Component. cmd-p:Parameter.Values.ParameterValue rfds:subClassOf cmdm:Component. cmd-p:Parameter.Values.ParameterValue.Description rfds:subClassOf cmdm:Element;

CR 2 RDF (cnt.) cmd-p:Parameter.Values.ParameterValue.Value rfds:subClassOf cmdm:Element. cmd-p:hasParameter.Values.ParameterValue.hasValueElementValue rdfs:subClassOf cmdm:hasElementValue; rdfs:domain cmd-p:Parameter.Values.ParameterValue.Value rdfs:range xsd:string. If the value domain is an enumeration there is an additional has...ElementEntity that has a range a Class from which each value (which gets a Component-based URI) is a subclass Entities can also have ConceptLinks which can later be used for more extensive mappings Missing? Nesting of Components and Elements is just represented by the generic cmdm:contains property cmd-p:Parameter.containsValues rdfs:subClassOf cmdm:contains; rdfs:domain cmd-p:Parameter; rdfs:range cmd-p:Parameter.Values.