Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI
2 Activities: CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN DARIAH joint taskforce VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community. SMC – Semantic Mapping Component a module in the CMD-Infrastructure goal: „semantic search“ = enhance the search in the heterogeneous data collection (of CMDI) a) by exploiting the shared data categories (SMC on schema level) b) by expressing the data in RDF (SMC on instance level) Context
Context II - CLARIN-AT CCV – CLARIN Center Vienna CenterProfile CMD record CenterProfile CMD record expected ready by: Infrastructure services: CLARIN Metadata Repository SMC – Semantic Mapping Component SMC-Browser Controlled Vocabularies engagement in CLARIN + DARIAH task forces 3
Old vision conceptualization sketch from
Potential usages for CV ● Metadata Generation, Curation ● Data-Enrichment / Annotation ● Data Analysis ● Search (Query Expansion, autocomplete, facets etc. ) ● needed for CMD2RDF - provide identifiers for entities (- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ? like equivalencies in Wikipedia (page for Johann Wolfgang Goethe): GND: | LCCN: n | NDL: | VIAF: )Johann Wolfgang Goethe 5
Related Activities ● DARIAH Schema Registry + Crosswalk Registry ● full-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD). ● CoNE – Control of Named ● EATS - Entity Authority Tool Zealand Electronic Text Centre (NZETC). ● TextGrid ● ● FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Access - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) FRBR RDA 6
Candidate Vocabularies ● Data Categories / Concepts - ISOcatISOcat ● Languages - ISO-639ISO-639 ● Countries - country codescountry codes ● Persons - GND, VIAF, dbpedia? ● Organizations - GND, VIAF, dbpedia? ● Schlagwörter/Subjects - GND, LCSH ● Resource Typology - ● Tagsets!? (with mappings between tags) AAT - international Architecture and Arts Thesaurus GND - Gemeinsame Norm Datei (DNB) GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) VIAF - Virtual_International_Authority_File GND VIAF 7
ISOcat and CLAVAS export closed+simple DCs (perhaps even better to manually select) Third party applications use - ISOcat for explain() function - CLAVAS for value(/entity)-lists 8
informed query input information about available data categories and values for those categories can be used as base for a complex query-input widget with context-sensitive autocomplete however this rather only as fallback to autocomplete based on actual data 9
CMD RDF Semantic Mapping on instance level express MD records in RDF (for LOD) => bind also values in MD fields to concepts Modelling aspects CMD Specification Data Categories CMD instances: - Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies => CLAVAS Ontological Relations Prefix namePrefix IRI rdf: rdfs: xsd: owl: skos: isocat: dcr: cmd: cmd_spec:? dce: dcterms: oa: ore: cr: used namespaces 10
11 Approach – Individuals/Instance Level One step when (pre)processing incoming new MD-sets 1.Express MD-Records as RDF-triples: 2.Identify potential target Domain Ontologies/Vocabularies 3.Create inverted Index: 4.Define lookup function: 5.Enrich dataset with new facts: 6.Property-values of Metadata-Records are linked to individuals of domain ontologies lookup(category, string-value) → label → entity
12 Candidate Categories/Properties ResourceType, Format, AnnotationLevelType → map to: isocat-DataCategories (Profiles: Metadata, Morphosyntax,...) Genre, Topic, Subject → map to: Taxonomies, Library Classification systems (LCSH, DDC, Dornseiff,...) Project, Institution, Person, Publisher open controlled vocabularies (real entities) → map to: CLAVAS-organisations, LT-World (perhaps others: LCCN, DBPedia?)
Next Steps Install current OpenSKOS at CCV – CLARIN Center Vienna synchronize 3 current datasets via OAI-PMH with sister instance at Meertens also to test the synchronization process (and implications) CMD2RDF „special groups vocabularies“ in CLARIN-AT Plant names Instruments 13
Appendix Explanations to SMC and CMDI 14
15 Semantic Mapping (schema level) - concept metadata fields in (completely) different profiles but bound to (the same) data categories (ConceptLinks) use this linkage when searching in the data i.e. allow the user to search a)„in the data category“ b)in a MD field but also all related fields from other profiles Multiple mapping levels: 1. just mapping based on the ConceptLink resolvable via ComponentRegistry different elements pointing to the same DatCat 2. use equivalence relations between DatCats from Relation Registry 3. use equivalence relations also between Container DatCats 4. use also other relations in Relation Registry (subClassOf, almostSameAs, …) 5. apply selected (user defined) relation sets from Relation Registry
16 CMDI linking components and elements in CMD profiles are bound to data categories the CMD records reference their profiles in Relation Registry data categories are related to each other in separate (possibly overlapping/contradicting) relation sets
17 Semantic Mapping Component separate CMDI module relies on information from ComponentRegistry, DCR, RelationRegistry is used by Metadata Repository / Service / Browser Task: resolution: dcrIndex ↔ cmdIndex dcrIndex :: (abstract) data category defined in DCR cmdIndex :: path to a field in a MDRecord (different from - query expansion: CQL(datcat) → CQL(cmdIndex[]) - query translation: e.g. CQL → XPath InputOutput dcrIndexisocat.DC-2545 (= isocat.resourceTitle) =>cmdIndex[][BamdesCommonFields.resourceTitle, imdi-corpus.Corpus.Title, …] cmdIndexActor.Role=>dcrIndexisocat:DC-2559 (participantRole)
18 Examples of DCR use in CMD metadata resourceName isocat:DC CorpusProfile.Corpus.Metadata.Name -CorpusProfile.Corpus.SourceList.Source.Name -collection.GeneralInfo.Name -Session.Name -imdi-corpus.Corpus.Name -ToolService.GeneralInfo.Name -GTRP.Collection.GeneralInfo.Name -DIDDD.Collection.GeneralInfo.Name -Soundbites.Collection.GeneralInfo.Name -DynaSAND.Collection.GeneralInfo.Name BUT: CMD Element: „Name“ … CMD Element name |distinct Elems| |distinct DatCats| Name4011 Type168 Title146 Language106 ID115 format105 identifier65 Description314 Code84 date124 publisher94 source104 subject64 Creator63 Address53 Organisation33 Availability63 datatype83 contributor43
19 Examples of DCR use in CMD metadata II languageID isocat:DC-2482 LrtInventoryResource.LrtCommon.Languages.ISO639.iso code Session.MDGroup.Content.Content_Languages.Content_Language.Id Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id Session.Resources.WrittenResource.LanguageId ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso code ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso code GTRP.Collection.DocumentationLanguages.Language.ISO639.iso code DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso code DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso code languageName isocat:DC-2484 ToolService.Documentation.DocumentationLanguages.Language.LanguageName ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName GTRP.Collection.DocumentationLanguages.Language.LanguageName DIDDD.Collection.DocumentationLanguages.Language.LanguageName DynaSAND.Collection.DocumentationLanguages.Language.LanguageName dct:language OLAC-DcmiTerms.language metadataLanguage isocat:DC-2543 CorpusProfile.Corpus.Metadata dominantLanguage isocat:DC-2468 Session.MDGroup.Content.Content_Languages.Content_Language.Dominant sourceLanguage isocat:DC-2494 Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage targetLanguage isocat:DC-2499 Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage implementationLanguage isocat:DC ToolService.Tool.Implementation.implementationLanguage
20 DCR usage in Component Registry Datcats in CompReg288 ISOcat164 dc-elems15 dc-terms55 private ISOcat DatCats (?)54 Elements with Datcats82,38% Components with Datcats67 Data Categories Sets827 isocat (Metadata Profile#5)712 dublincore elements16 dublincore terms99 Component Registry CMD-Profiles53 standalone Components235*) overall Components298 distinct Elements893 all Elements3.030 all paths (profile/comp/elem4.565 Components structure as of
SMC Browser 21 TODO feed with statistics of the instance data add relations from RELcat add operations on graphs (intersection, difference, …) Explore the Component Metadata Framework Profile specifications from Component Registry visualized as interactive graphs statistics (about reuse of Components)
SMC Browser Explore the Component Metadata Framework 22