STITCH project CATCH User Group January 30th 2007
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
Motivation Current CH trend: portals that build on heterogeneous collections Different databases Documents described/accessed according to different points of view (controlled vocabularies/MD schemes)
CH Interoperability Problems Current CH trend: portals that build on heterogeneous collections Different databases/vocabularies/MD schemes Syntactic interoperability problem is being solved Access can be granted, cf. deployed portals Semantic interoperability still to be addressed Links with original vocabularies/MD structures are lost
STITCH General Goals [SemanTic Interoperability To access Cultural Heritage] Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies
STITCH General Goals (2) Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies Using Semantic Web means for Representation of the different points of view in one system Creation and use of the alignment knowledge 2 methodological concerns Generalize as much as possible Automatize as much as possible
Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
Experiment On a reduced scale 2 collections and associated vocabularies Output wished: insights on Use of SW off-the-shelf techniques with CH-specific resources Impact of turning to standard proposals (SW-linked tools and methods) In a context of natural semantics (thesauri) Added value of this effort Quantitative and qualitative evaluation Simple prototype for accessing documents
1 st Collection: KB Illustrated Manuscripts
2 nd Collection: Rijksmuseum ARIA collection
Experiment Steps
Steps
Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository
SKOS Simple Knowledge Organisation Systems Model to represent traditional vocabularies (thesauri, classification schemes) on the Semantic Web Classes and properties to create XML/RDF data Concepts and Concept schemes Lexical properties (prefLabel, altLabel) Semantic relations (broader, related) Notes (scopeNote, definition)
Vocabulary Formalisation: ARIA in SKOS
Steps
Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository
Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long
Collection Integration: Ontology Mapping Tools Tests with 2 mapping tools S-Match, Trento Tree-like structures mapper Falcon-AO, Nanjing Standard OWL ontology mapper Using Lexical comparisons Structural comparisons Third resource (Wordnet as ‘oracle’)
Mappings
Steps Adapted faceted browsing paradigm (Flamenco) Search by navigating through several dimensions Adaptation of the paradigm: From facets corresponding to orthogonal dimensions of object description (‘material’, ‘location’) to facets corresponding to different conceptual schemes (ARIA, IconClass) 3 views (sets of facet definitions) on integrated collections Single view Combined view Merged view
Collections Access: Single View Facets based on 1 concept scheme Access to objects indexed against concepts from other schemes If mapping between their index and the selected concepts A single point of view on integrated data set
Collections Access: Combined View Search based on 2 concepts schemes Facets attached to the different vocabularies are presented Simultaneous access from different points of view on the same data
Collections Access: Merged View Facets using a merged concept scheme with hierarchical links coming from schemes and alignment Making the links between vocabularies more visible during search A way to ‘enrich’ weakly structured vocabularies
Collection Access: demo
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
KB books, collections and vocabularies
KB Vocabularies Brinkman Large (5200 terms) Weakly structured 1-level deep hierarchy GTT general subjects Huge (35000 terms) Very weakly structured : 0.5-level deep hierarchy [NBC] Large (2000 classes) Weakly but regularly structured (balanced 2-level deep classification) Common point: almost standard thesaurus information Associative relationships (RT) Synonyms/non-preferred terms Scope notes
KB Vocabularies Thv zzz "het zoeken van patronen, regelmatigheden of zelfs kennis in databases. De inductie van begrijpelijke modellen en patronen uit databases" databanken kunstmatige intelligentie data mining knowledge discovery in databases KDD ICT-zakboekje
KB Aim Integration of GTT and Brikman One single subject vocabulary instead of two Requirement: keeping links to old indexing subjects Thesaurus refinement Focusing on KB scientific interest (humanities) Re-structuring thesaurus: more hierarchical links top terms for GTT!
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
Manuscripts, 1 st Collection: KB Illustrated Manuscripts
Manuscripts, 2 nd Collection: BNF Mandragore
Manuscripts vocabularies Mandragore Huge (16000 terms) Weakly structured (2-level deep, multi-inheritance) Alternative lexical forms Definitions IconClass Huge (>24000 subjects) Richly structured : 10 level hierarchy, cross-references Compound concepts: keys, structural digits… Keywords
Manuscripts, Aim Integrated access All illuminations via Mandragore vocabulary All illuminations via Iconclass vocabulary
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
Iconclass Iconclass contains complex information Links between normal subjects and possible qualifiers Compound concepts Local extensions Existing representation is mostly text-based Aims Building a Semantic Web-enabled complete representation Dedicated ontology Conversion process implemented -> 1.2 M RDF triples Providing this representation as a (web) service As well as a standard SKOS version
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evualuation STITCH between scientific research and CH domain
Steps Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository
Conversion into RDF specific model and SKOS BNF export tagInterpretationConversion to Record-specific RDFS model SKOS interpretation Descriptor inScheme #descripteursSchemeinScheme #mandragoreScheme libellelabelmd:hasPreferredLabelprefLabel xml:lang="fr" descriptionDefinition notemd:hasDefinitiondefinition formes- rejetees/forme- rejetee/libelle (+ optional /description) Rejected label (and optional description) md:hasRejectedForm which points at an anonymous [RejectedForm] resource with hasRejectedLabel and hasRejectedLabelDefinition with textual values respectively set to the content of libelle and description elements (using rdf:parseType=Resource) altLabel xml:lang="fr" + definition notesComplementary definitionmd:hasNotenote codes-dewey/code- dewey Thematic classification (given by DDC code which is attached to a classification element) md:hasThematicClassificationbroader
Transformation into SKOS Example grégoire 11 pierre roger de beaufort cardinal diacre de sainte-marie-nouvelle, pape Conversion of thesauri main features Preferred and alternative labels Semantic relationships (BT, RT) Notes (scope notes, definitions)
Collection Formalization Problems Interpreting and representing vocabularies using formal standards is hindered by expressivity variation Complex models Non-standard features Fuzzy structures, weakly structured Some information is lost when converting to SKOS Qualifiers Compound concepts Relation between terms (not only concepts) We kept complete models Adhoc ontologies, cf. Iconclass
Collection Formalization Problems System-specific conversions were done Depending on application environment Standard RDFS expressivity and implemented tools Depending on the mapping tools, which might make different hypotheses on the nature of knowledge to align OWL classes vs. nodes in trees
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain
Steps Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository
Lessons learned: Collection Integration We have ontology mappers, not thesaurus mappers Input: pre-processing to pure RDFS/OWL ontologies Mapping process Using resources that may be absent from CH vocabularies Rich formal/structural information Not (properly) using all information found in CH vocabularies E.g. rich lexical information Output: needs re-interpretation of mapping relations
Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long
Alignment: lessons learned from previous experiments Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data
Alignment: here Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data
Lexical alignment: Manuscripts case [Monolingual case, since IC comes in French] Basic label comparison Preferred labels Alternative labels Going beyond labels Labels and definitions IC keywords and Mandragore labels Lexical information as bags-of-words Words in IC (glossy labels) found in Mandragore labels, and vice versaWords in IC (glossy labels) found in Mandragore labels Words in Mandragore definitions found in IC labels
Lexical alignment: Manuscripts case From 430 to found matches Some redundant Some comparisons bringing quite some noise Interesting is that we have a gradation Interesting coverage for the application Mandragore terms accessible from an IC term IC terms accessible from a MG term Fuzziness of original hierarchies allows for (associative) noise Problems: Better NLP treatments (e.g. lemmatization) Choice of proper alignment link depending on the features compared
Lexical alignment: Manuscripts case broaderEquivalent
Demo Corn
Statistic approach: KB case
Comparing documents indexed with BK concepts and documents indexed with GTT concepts Overlap measure
Statistic approach: problems Finding threshold to filter resultsresults Taking into account thesaurus use Levels of indexing are different Statistical significance is not granted Overlap measure is less significant when concepts are used only a few times
Using background knowledge Interesting research BK brings additional structural semantics to concepts BK brings more lexical knowledge (synonyms) in the loop Problem: needs to find proper resource/dictionary Domain-specific vocabularies Language-specific vocabularies
First experiments on anchoring GTT to Wordnet [with Véronique Malaisé, CHOICE] Setting Using an online Dutch-English dictionary Comparing translations (and definitions) found with Wordnet content Nice feature: many GTT are already manually translated Results Poor recall: 9% of concepts for which there was a manual translation were anchored to WN Problems: Encoding Domain-specificity Complex terms Results are better with another vocabulary
Other Dutch vocabularies hanging around
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain
A transversal problem: evaluation Assessing quality of mapping In a specific context Taking into account Use of thesauri (indexing levels) Integration aim (hierarchical browsing) Designing evaluation tools Methods to evaluate samples to guide mapping process at a low cost And yet have statistical relevance
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
Pushing Research Results to the CH World Paper publications European Conference on Digital Libaries 2006 Informatie Professional 2006 Dissemination papers on SKOS and OWL Talks, demonstrations done and planned Digital Erfgoed Conference BNF KB RNA demo middag UDC seminar Lecture for Masters on Book & Digital Media (Leiden) SKOS CATCH day
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
Concrete Collaborations Collaborations with CH institutes Digitaal Erfgoed Nederland Creation of a thesaurus inventory questionnaire KB experts Illuminated Manuscripts Operational departments BNF Illuminated Mansucripts Rijksbureau Kunsthistorische Documentatie Iconclass [Illuminare (Leuven)]
Pushing Research Results to the CH World CH-oriented research projects The European Library Research proposal on multilingual thesaurus alignment CATCH Rijksmuseum collections (CHIP) Anchoring GTT to Wordnet (CHOICE) Metadata Recommendation (CHOICE, MITCH) Iconclass and GTAA Service (CHOICE) Mapping and vocabulary repository (CHOICE) RNA
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
Confrontation of existing SW tools to real CH data VU talks and collaborations External collaboration (Trento) Papers BNAIC SWI Prolog and the Web (Theories and Practices for logic programming) Participation in W3C Semantic Web Deployment working group Editor of SKOS use cases and requirements document Contribution of Manuscript and Iconclass use cases
Free discussion