Download presentation
Presentation is loading. Please wait.
Published byHenry Johnston Modified over 9 years ago
1
STITCH project CATCH User Group January 30th 2007
2
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
3
Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
4
Motivation Current CH trend: portals that build on heterogeneous collections Different databases Documents described/accessed according to different points of view (controlled vocabularies/MD schemes)
6
CH Interoperability Problems Current CH trend: portals that build on heterogeneous collections Different databases/vocabularies/MD schemes Syntactic interoperability problem is being solved Access can be granted, cf. deployed portals Semantic interoperability still to be addressed Links with original vocabularies/MD structures are lost
8
STITCH General Goals [SemanTic Interoperability To access Cultural Heritage] Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies
10
STITCH General Goals (2) Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies Using Semantic Web means for Representation of the different points of view in one system Creation and use of the alignment knowledge 2 methodological concerns Generalize as much as possible Automatize as much as possible
11
Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain
12
Experiment On a reduced scale 2 collections and associated vocabularies Output wished: insights on Use of SW off-the-shelf techniques with CH-specific resources Impact of turning to standard proposals (SW-linked tools and methods) In a context of natural semantics (thesauri) Added value of this effort Quantitative and qualitative evaluation Simple prototype for accessing documents
13
1 st Collection: KB Illustrated Manuscripts
15
2 nd Collection: Rijksmuseum ARIA collection
17
Experiment Steps
18
Steps
19
Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository
20
SKOS Simple Knowledge Organisation Systems Model to represent traditional vocabularies (thesauri, classification schemes) on the Semantic Web Classes and properties to create XML/RDF data Concepts and Concept schemes Lexical properties (prefLabel, altLabel) Semantic relations (broader, related) Notes (scopeNote, definition)
21
Vocabulary Formalisation: ARIA in SKOS
22
Steps
23
Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository
25
Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long
26
Collection Integration: Ontology Mapping Tools Tests with 2 mapping tools S-Match, Trento Tree-like structures mapper Falcon-AO, Nanjing Standard OWL ontology mapper Using Lexical comparisons Structural comparisons Third resource (Wordnet as ‘oracle’)
27
Mappings
29
Steps Adapted faceted browsing paradigm (Flamenco) Search by navigating through several dimensions Adaptation of the paradigm: From facets corresponding to orthogonal dimensions of object description (‘material’, ‘location’) to facets corresponding to different conceptual schemes (ARIA, IconClass) 3 views (sets of facet definitions) on integrated collections Single view Combined view Merged view
30
Collections Access: Single View Facets based on 1 concept scheme Access to objects indexed against concepts from other schemes If mapping between their index and the selected concepts A single point of view on integrated data set
31
Collections Access: Combined View Search based on 2 concepts schemes Facets attached to the different vocabularies are presented Simultaneous access from different points of view on the same data
32
Collections Access: Merged View Facets using a merged concept scheme with hierarchical links coming from schemes and alignment Making the links between vocabularies more visible during search A way to ‘enrich’ weakly structured vocabularies
33
Collection Access: demo http://stitch.cs.vu.nl/demo
34
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
35
KB books, collections and vocabularies
36
KB Vocabularies Brinkman Large (5200 terms) Weakly structured 1-level deep hierarchy GTT general subjects Huge (35000 terms) Very weakly structured : 0.5-level deep hierarchy [NBC] Large (2000 classes) Weakly but regularly structured (balanced 2-level deep classification) Common point: almost standard thesaurus information Associative relationships (RT) Synonyms/non-preferred terms Scope notes
37
KB Vocabularies 203379888 Thv zzz "het zoeken van patronen, regelmatigheden of zelfs kennis in databases. De inductie van begrijpelijke modellen en patronen uit databases" 075603705 databanken 075652528 kunstmatige intelligentie data mining knowledge discovery in databases KDD ICT-zakboekje. - 1999
38
KB Aim Integration of GTT and Brikman One single subject vocabulary instead of two Requirement: keeping links to old indexing subjects Thesaurus refinement Focusing on KB scientific interest (humanities) Re-structuring thesaurus: more hierarchical links 19769 top terms for GTT!
39
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
40
Manuscripts, 1 st Collection: KB Illustrated Manuscripts
42
Manuscripts, 2 nd Collection: BNF Mandragore
44
Manuscripts vocabularies Mandragore Huge (16000 terms) Weakly structured (2-level deep, multi-inheritance) Alternative lexical forms Definitions IconClass Huge (>24000 subjects) Richly structured : 10 level hierarchy, cross-references Compound concepts: keys, structural digits… Keywords
46
Manuscripts, Aim Integrated access All illuminations via Mandragore vocabulary All illuminations via Iconclass vocabulary
47
Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain
48
Iconclass Iconclass contains complex information Links between normal subjects and possible qualifiers Compound concepts Local extensions Existing representation is mostly text-based Aims Building a Semantic Web-enabled complete representation Dedicated ontology Conversion process implemented -> 1.2 M RDF triples Providing this representation as a (web) service As well as a standard SKOS version
49
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evualuation STITCH between scientific research and CH domain
50
Steps Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository
51
Conversion into RDF specific model and SKOS BNF export tagInterpretationConversion to Record-specific RDFS model SKOS interpretation Descriptor about=namespace+’d_’+@id Concept @ididmd:descriptor-id inScheme #descripteursSchemeinScheme #mandragoreScheme libellelabelmd:hasPreferredLabelprefLabel xml:lang="fr" descriptionDefinition notemd:hasDefinitiondefinition formes- rejetees/forme- rejetee/libelle (+ optional /description) Rejected label (and optional description) md:hasRejectedForm which points at an anonymous [RejectedForm] resource with hasRejectedLabel and hasRejectedLabelDefinition with textual values respectively set to the content of libelle and description elements (using rdf:parseType=Resource) altLabel xml:lang="fr" + definition notesComplementary definitionmd:hasNotenote codes-dewey/code- dewey Thematic classification (given by DDC code which is attached to a classification element) md:hasThematicClassificationbroader
52
Transformation into SKOS Example grégoire 11 pierre roger de beaufort cardinal diacre de sainte-marie-nouvelle, pape Conversion of thesauri main features Preferred and alternative labels Semantic relationships (BT, RT) Notes (scope notes, definitions)
53
Collection Formalization Problems Interpreting and representing vocabularies using formal standards is hindered by expressivity variation Complex models Non-standard features Fuzzy structures, weakly structured Some information is lost when converting to SKOS Qualifiers Compound concepts Relation between terms (not only concepts) We kept complete models Adhoc ontologies, cf. Iconclass
54
Collection Formalization Problems System-specific conversions were done Depending on application environment Standard RDFS expressivity and implemented tools Depending on the mapping tools, which might make different hypotheses on the nature of knowledge to align OWL classes vs. nodes in trees
55
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain
56
Steps Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository
57
Lessons learned: Collection Integration We have ontology mappers, not thesaurus mappers Input: pre-processing to pure RDFS/OWL ontologies Mapping process Using resources that may be absent from CH vocabularies Rich formal/structural information Not (properly) using all information found in CH vocabularies E.g. rich lexical information Output: needs re-interpretation of mapping relations
58
Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long
59
Alignment: lessons learned from previous experiments Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data
60
Alignment: here Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data
61
Lexical alignment: Manuscripts case [Monolingual case, since IC comes in French] Basic label comparison Preferred labels Alternative labels Going beyond labels Labels and definitions IC keywords and Mandragore labels Lexical information as bags-of-words Words in IC (glossy labels) found in Mandragore labels, and vice versaWords in IC (glossy labels) found in Mandragore labels Words in Mandragore definitions found in IC labels
62
Lexical alignment: Manuscripts case From 430 to 21000 found matches Some redundant Some comparisons bringing quite some noise Interesting is that we have a gradation Interesting coverage for the application 12300 Mandragore terms accessible from an IC term 22800 IC terms accessible from a MG term Fuzziness of original hierarchies allows for (associative) noise Problems: Better NLP treatments (e.g. lemmatization) Choice of proper alignment link depending on the features compared
63
Lexical alignment: Manuscripts case broaderEquivalent
64
Demo Corn
68
Statistic approach: KB case
69
Comparing documents indexed with BK concepts and documents indexed with GTT concepts Overlap measure
70
Statistic approach: problems Finding threshold to filter resultsresults Taking into account thesaurus use Levels of indexing are different Statistical significance is not granted Overlap measure is less significant when concepts are used only a few times
71
Using background knowledge Interesting research BK brings additional structural semantics to concepts BK brings more lexical knowledge (synonyms) in the loop Problem: needs to find proper resource/dictionary Domain-specific vocabularies Language-specific vocabularies
72
First experiments on anchoring GTT to Wordnet [with Véronique Malaisé, CHOICE] Setting Using an online Dutch-English dictionary Comparing translations (and definitions) found with Wordnet content Nice feature: many GTT are already manually translated Results Poor recall: 9% of concepts for which there was a manual translation were anchored to WN Problems: Encoding Domain-specificity Complex terms Results are better with another vocabulary
73
Other Dutch vocabularies hanging around
74
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain
75
A transversal problem: evaluation Assessing quality of mapping In a specific context Taking into account Use of thesauri (indexing levels) Integration aim (hierarchical browsing) Designing evaluation tools Methods to evaluate samples to guide mapping process at a low cost And yet have statistical relevance
76
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
77
Pushing Research Results to the CH World Paper publications European Conference on Digital Libaries 2006 Informatie Professional 2006 Dissemination papers on SKOS and OWL Talks, demonstrations done and planned Digital Erfgoed Conference BNF KB RNA demo middag UDC seminar Lecture for Masters on Book & Digital Media (Leiden) SKOS tutorial @ CATCH day
78
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
79
Concrete Collaborations Collaborations with CH institutes Digitaal Erfgoed Nederland Creation of a thesaurus inventory questionnaire KB experts Illuminated Manuscripts Operational departments BNF Illuminated Mansucripts Rijksbureau Kunsthistorische Documentatie Iconclass [Illuminare (Leuven)]
80
Pushing Research Results to the CH World CH-oriented research projects The European Library Research proposal on multilingual thesaurus alignment CATCH Rijksmuseum collections (CHIP) Anchoring GTT to Wordnet (CHOICE) Metadata Recommendation (CHOICE, MITCH) Iconclass and GTAA Service (CHOICE) Mapping and vocabulary repository (CHOICE) RNA
81
Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community
82
Confrontation of existing SW tools to real CH data VU talks and collaborations External collaboration (Trento) Papers BNAIC SWI Prolog and the Web (Theories and Practices for logic programming) Participation in W3C Semantic Web Deployment working group Editor of SKOS use cases and requirements document Contribution of Manuscript and Iconclass use cases
83
Free discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.