2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI.


Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Resource description and access for the digital world Gordon Dunsire Centre for Digital Library Research University of Strathclyde Scotland.
DC2001, Tokyo DCMI Registry : Background and demonstration DC2001 Tokyo October 2001 Rachel Heery, UKOLN, University of Bath Harry Wagner, OCLC
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin.
Digital Repositories – Linked Open Data – the possible Role of D4Science Workshop, December 2010, FAO use cases A tool to create Linked Data providers.
UKOLN, University of Bath
Alexandria Digital Library Project Integration of Knowledge Organization Systems into Digital Library Architectures Linda Hill, Olha Buchel, Greg Janée.
February Harvesting RDF metadata Building digital library portals with harvested metadata workshop EU-DL All Projects concertation meeting DELOS.
From content standards to RDF Gordon Dunsire Presented at AKM 15, Porec, 2011.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Interoperability Aspects in Europeana Antoine Isaac Workshop on Research Metadata in Context 7./8. September 2010, Nijmegen.
The Semantic Web – WEEK 4: RDF
Logics for Data and Knowledge Representation Projects and thesis introduction.
Developing a Metadata Exchange Format for Mathematical Literature David Ruddy Project Euclid Cornell University Library DML 2010 Paris 7 July 2010.
ESDSWG2011 – Semantic Web session Semantic Web Sub-group Session ESDSWG 2011 Meeting – Semantic Web sub-group session Wednesday, November 2, 2011 Norfolk,
Supported by EU projects 12/12/2013 Athens, Greece Open Data in Agriculture Hands-on with data infrastructures that can power your agricultural data products.
RDF Tutorial.
From CLARIN Component Metadata to Linked Open Data
Flexible Syntax and Concept Registries as a basis for Metadata Daan Broeder TLA - MPI for Psycholinguistics & CLARIN Metadata in Context, APA/CLARIN Workshop,
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
TLA/CLARIN CLAVAS Use Cases: Overview CMDI integration – Metadata editing Resource Annotation Kinship data.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
Metadata Standards and Applications 4. Metadata Syntaxes and Containers.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
Agenda CMDI Workshop 9.15 Welcome 9.30 Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.15Coffee 10.30Use of ISOCat within CMDI.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
The OAI-ORE based data model of Europeana and the Digital Public Library of America: implications for educational publishing Dov Winer MAKASH – Advancing.
Use cases Gordon Dunsire. UC: Bibliographic network +Identification and deduplication of library records +Regional catalogue +Data BNF +*Community Information.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Metadata & CMDI CLARIN Component Metadata Infrastructure Daan Broeder et al. Max-Planck Institute for Psycholinguistics CLARIN NL CMDI Metadata Tutorial.
XML DTDs and other Alternatives: Vocabulary Markup Language (Voc-ML) Project & Friends Joseph A. Busch Director, Solutions Architecture NetLab and Friends.
CLARIN Metadata Infrastructure Component Metadata and intermediate solutions Daan Broeder Claus Zinn Dieter van Uytvanck - Max-Planck Institute for Psycholinguistics.
Lifecycle Metadata for Digital Objects (INF 389K) September 18, 2006 The Big Metadata Picture, Web Access, and the W3C Context.
NERC DataGrid NERC DataGrid Vocabulary Server Use Cases Vocabulary Workshop, RAL, February 25, 2009.
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Agenda CMDI Tutorial 9.30 Welcome & Coffee Introduction to metadata and the CLARIN Metadata Infrastructure (CMDI) 10.30CMDI & ISO-DCR 10.50The CMDI.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
THE BIBFRAME EDITOR AND THE LC PILOT Module 3 – Unit 1 The Semantic Web and Linked Data : a Recap of the Key Concepts Library of Congress BIBFRAME Pilot.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
CNI Spring 2016 Membership Meeting San Antonio TX Linked Data Implementations— Who, What and Why? Karen Smith-Yoshimura OCLC Research.
CMD and TEI CMDI interoperability workshop Utrecht Matej Ďurčo, ICLTT, Vienna.
Linked Library (+AM) Data Presented LITA Next-Generation Catalog IG Corey A Harper Publish, Enrich, Relate and Un-Silo.
Enhancing the Quality of Metadata by using Authority Control Thorsten Trippel, Claus Zinn LDL 2016 Workshop at LREC May 23-28, Portorož (Slovenia)
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Recording RDA data as linked data
Cataloging the Internet
PREMIS Tools and Services
Session 2: Metadata and Catalogues
RDA Community and linked data
Presentation transcript:

Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI

2 Activities: CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN DARIAH joint taskforce VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community. SMC – Semantic Mapping Component a module in the CMD-Infrastructure goal: „semantic search“ = enhance the search in the heterogeneous data collection (of CMDI) a) by exploiting the shared data categories (SMC on schema level) b) by expressing the data in RDF (SMC on instance level) Context

Context II - CLARIN-AT CCV – CLARIN Center Vienna CenterProfile CMD record CenterProfile CMD record expected ready by: Infrastructure services: CLARIN Metadata Repository SMC – Semantic Mapping Component SMC-Browser Controlled Vocabularies engagement in CLARIN + DARIAH task forces 3

Old vision conceptualization sketch from

Potential usages for CV ● Metadata Generation, Curation ● Data-Enrichment / Annotation ● Data Analysis ● Search (Query Expansion, autocomplete, facets etc. ) ● needed for CMD2RDF - provide identifiers for entities (- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ? like equivalencies in Wikipedia (page for Johann Wolfgang Goethe): GND: | LCCN: n | NDL: | VIAF: )Johann Wolfgang Goethe 5

Related Activities ● DARIAH Schema Registry + Crosswalk Registry ● full-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD). ● CoNE – Control of Named ● EATS - Entity Authority Tool Zealand Electronic Text Centre (NZETC). ● TextGrid ● ● FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Access - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) FRBR RDA 6

Candidate Vocabularies ● Data Categories / Concepts - ISOcatISOcat ● Languages - ISO-639ISO-639 ● Countries - country codescountry codes ● Persons - GND, VIAF, dbpedia? ● Organizations - GND, VIAF, dbpedia? ● Schlagwörter/Subjects - GND, LCSH ● Resource Typology - ● Tagsets!? (with mappings between tags) AAT - international Architecture and Arts Thesaurus GND - Gemeinsame Norm Datei (DNB) GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) VIAF - Virtual_International_Authority_File GND VIAF 7

ISOcat and CLAVAS export closed+simple DCs (perhaps even better to manually select) Third party applications use - ISOcat for explain() function - CLAVAS for value(/entity)-lists 8

informed query input information about available data categories and values for those categories can be used as base for a complex query-input widget with context-sensitive autocomplete however this rather only as fallback to autocomplete based on actual data 9

CMD  RDF Semantic Mapping on instance level express MD records in RDF (for LOD) => bind also values in MD fields to concepts Modelling aspects CMD Specification Data Categories CMD instances: - Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies => CLAVAS Ontological Relations Prefix namePrefix IRI rdf: rdfs: xsd: owl: skos: isocat: dcr: cmd: cmd_spec:? dce: dcterms: oa: ore: cr: used namespaces 10

11 Approach – Individuals/Instance Level One step when (pre)processing incoming new MD-sets 1.Express MD-Records as RDF-triples: 2.Identify potential target Domain Ontologies/Vocabularies 3.Create inverted Index: 4.Define lookup function: 5.Enrich dataset with new facts: 6.Property-values of Metadata-Records are linked to individuals of domain ontologies lookup(category, string-value) → label → entity

12 Candidate Categories/Properties ResourceType, Format, AnnotationLevelType → map to: isocat-DataCategories (Profiles: Metadata, Morphosyntax,...) Genre, Topic, Subject → map to: Taxonomies, Library Classification systems (LCSH, DDC, Dornseiff,...) Project, Institution, Person, Publisher open controlled vocabularies (real entities) → map to: CLAVAS-organisations, LT-World (perhaps others: LCCN, DBPedia?)

Next Steps Install current OpenSKOS at CCV – CLARIN Center Vienna synchronize 3 current datasets via OAI-PMH with sister instance at Meertens also to test the synchronization process (and implications) CMD2RDF „special groups vocabularies“ in CLARIN-AT Plant names Instruments 13

Appendix Explanations to SMC and CMDI 14

15 Semantic Mapping (schema level) - concept metadata fields in (completely) different profiles but bound to (the same) data categories (ConceptLinks) use this linkage when searching in the data i.e. allow the user to search a)„in the data category“ b)in a MD field but also all related fields from other profiles Multiple mapping levels: 1. just mapping based on the ConceptLink resolvable via ComponentRegistry different elements pointing to the same DatCat 2. use equivalence relations between DatCats from Relation Registry 3. use equivalence relations also between Container DatCats 4. use also other relations in Relation Registry (subClassOf, almostSameAs, …) 5. apply selected (user defined) relation sets from Relation Registry

16 CMDI linking components and elements in CMD profiles are bound to data categories the CMD records reference their profiles in Relation Registry data categories are related to each other in separate (possibly overlapping/contradicting) relation sets

17 Semantic Mapping Component separate CMDI module relies on information from ComponentRegistry, DCR, RelationRegistry is used by Metadata Repository / Service / Browser Task: resolution: dcrIndex ↔ cmdIndex dcrIndex :: (abstract) data category defined in DCR cmdIndex :: path to a field in a MDRecord (different from - query expansion: CQL(datcat) → CQL(cmdIndex[]) - query translation: e.g. CQL → XPath InputOutput dcrIndexisocat.DC-2545 (= isocat.resourceTitle) =>cmdIndex[][BamdesCommonFields.resourceTitle, imdi-corpus.Corpus.Title, …] cmdIndexActor.Role=>dcrIndexisocat:DC-2559 (participantRole)

18 Examples of DCR use in CMD metadata resourceName isocat:DC CorpusProfile.Corpus.Metadata.Name -CorpusProfile.Corpus.SourceList.Source.Name -collection.GeneralInfo.Name -Session.Name -imdi-corpus.Corpus.Name -ToolService.GeneralInfo.Name -GTRP.Collection.GeneralInfo.Name -DIDDD.Collection.GeneralInfo.Name -Soundbites.Collection.GeneralInfo.Name -DynaSAND.Collection.GeneralInfo.Name BUT: CMD Element: „Name“           … CMD Element name |distinct Elems| |distinct DatCats| Name4011 Type168 Title146 Language106 ID115 format105 identifier65 Description314 Code84 date124 publisher94 source104 subject64 Creator63 Address53 Organisation33 Availability63 datatype83 contributor43

19 Examples of DCR use in CMD metadata II languageID isocat:DC-2482  LrtInventoryResource.LrtCommon.Languages.ISO639.iso code  Session.MDGroup.Content.Content_Languages.Content_Language.Id  Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id  Session.Resources.WrittenResource.LanguageId  ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso code  ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso code  GTRP.Collection.DocumentationLanguages.Language.ISO639.iso code  DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso code  DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso code languageName isocat:DC-2484  ToolService.Documentation.DocumentationLanguages.Language.LanguageName  ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName  GTRP.Collection.DocumentationLanguages.Language.LanguageName  DIDDD.Collection.DocumentationLanguages.Language.LanguageName  DynaSAND.Collection.DocumentationLanguages.Language.LanguageName dct:language  OLAC-DcmiTerms.language metadataLanguage isocat:DC-2543  CorpusProfile.Corpus.Metadata dominantLanguage isocat:DC-2468  Session.MDGroup.Content.Content_Languages.Content_Language.Dominant sourceLanguage isocat:DC-2494  Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage targetLanguage isocat:DC-2499  Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage implementationLanguage isocat:DC ToolService.Tool.Implementation.implementationLanguage

20 DCR usage in Component Registry Datcats in CompReg288 ISOcat164 dc-elems15 dc-terms55 private ISOcat DatCats (?)54 Elements with Datcats82,38% Components with Datcats67 Data Categories Sets827 isocat (Metadata Profile#5)712 dublincore elements16 dublincore terms99 Component Registry CMD-Profiles53 standalone Components235*) overall Components298 distinct Elements893 all Elements3.030 all paths (profile/comp/elem4.565 Components structure as of

SMC Browser 21 TODO feed with statistics of the instance data add relations from RELcat add operations on graphs (intersection, difference, …) Explore the Component Metadata Framework Profile specifications from Component Registry visualized as interactive graphs statistics (about reuse of Components)

SMC Browser Explore the Component Metadata Framework 22