Presentation is loading. Please wait.

Presentation is loading. Please wait.

STITCH project CATCH User Group January 30th 2007.

Similar presentations


Presentation on theme: "STITCH project CATCH User Group January 30th 2007."— Presentation transcript:

1 STITCH project CATCH User Group January 30th 2007

2 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain

3 Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain

4 Motivation Current CH trend: portals that build on heterogeneous collections Different databases Documents described/accessed according to different points of view (controlled vocabularies/MD schemes)

5

6 CH Interoperability Problems Current CH trend: portals that build on heterogeneous collections Different databases/vocabularies/MD schemes Syntactic interoperability problem is being solved Access can be granted, cf. deployed portals Semantic interoperability still to be addressed Links with original vocabularies/MD structures are lost

7

8 STITCH General Goals [SemanTic Interoperability To access Cultural Heritage] Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies

9

10 STITCH General Goals (2) Allow heterogeneous CH collections to be accessed In a seamless way Still benefiting from specific collection commitments Keeping original metadata schemes and vocabularies Using Semantic Web means for Representation of the different points of view in one system Creation and use of the alignment knowledge 2 methodological concerns Generalize as much as possible Automatize as much as possible

11 Summary Global presentation of project and past work General motivations Pilot project Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain

12 Experiment On a reduced scale 2 collections and associated vocabularies Output wished: insights on Use of SW off-the-shelf techniques with CH-specific resources Impact of turning to standard proposals (SW-linked tools and methods) In a context of natural semantics (thesauri) Added value of this effort Quantitative and qualitative evaluation Simple prototype for accessing documents

13 1 st Collection: KB Illustrated Manuscripts

14

15 2 nd Collection: Rijksmuseum ARIA collection

16

17 Experiment Steps

18 Steps

19 Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository

20 SKOS Simple Knowledge Organisation Systems Model to represent traditional vocabularies (thesauri, classification schemes) on the Semantic Web Classes and properties to create XML/RDF data Concepts and Concept schemes Lexical properties (prefLabel, altLabel) Semantic relations (broader, related) Notes (scopeNote, definition)

21 Vocabulary Formalisation: ARIA in SKOS

22 Steps

23 Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository

24

25 Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long

26 Collection Integration: Ontology Mapping Tools Tests with 2 mapping tools S-Match, Trento Tree-like structures mapper Falcon-AO, Nanjing Standard OWL ontology mapper Using Lexical comparisons Structural comparisons Third resource (Wordnet as ‘oracle’)

27 Mappings

28

29 Steps Adapted faceted browsing paradigm (Flamenco) Search by navigating through several dimensions Adaptation of the paradigm: From facets corresponding to orthogonal dimensions of object description (‘material’, ‘location’) to facets corresponding to different conceptual schemes (ARIA, IconClass) 3 views (sets of facet definitions) on integrated collections Single view Combined view Merged view

30 Collections Access: Single View Facets based on 1 concept scheme Access to objects indexed against concepts from other schemes If mapping between their index and the selected concepts A single point of view on integrated data set

31 Collections Access: Combined View Search based on 2 concepts schemes Facets attached to the different vocabularies are presented Simultaneous access from different points of view on the same data

32 Collections Access: Merged View Facets using a merged concept scheme with hierarchical links coming from schemes and alignment Making the links between vocabularies more visible during search A way to ‘enrich’ weakly structured vocabularies

33 Collection Access: demo http://stitch.cs.vu.nl/demo

34 Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain

35 KB books, collections and vocabularies

36 KB Vocabularies Brinkman Large (5200 terms) Weakly structured 1-level deep hierarchy GTT general subjects Huge (35000 terms) Very weakly structured : 0.5-level deep hierarchy [NBC] Large (2000 classes) Weakly but regularly structured (balanced 2-level deep classification) Common point: almost standard thesaurus information Associative relationships (RT) Synonyms/non-preferred terms Scope notes

37 KB Vocabularies 203379888 Thv zzz "het zoeken van patronen, regelmatigheden of zelfs kennis in databases. De inductie van begrijpelijke modellen en patronen uit databases" 075603705 databanken 075652528 kunstmatige intelligentie data mining knowledge discovery in databases KDD ICT-zakboekje. - 1999

38 KB Aim Integration of GTT and Brikman One single subject vocabulary instead of two Requirement: keeping links to old indexing subjects Thesaurus refinement Focusing on KB scientific interest (humanities) Re-structuring thesaurus: more hierarchical links 19769 top terms for GTT!

39 Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain

40 Manuscripts, 1 st Collection: KB Illustrated Manuscripts

41

42 Manuscripts, 2 nd Collection: BNF Mandragore

43

44 Manuscripts vocabularies Mandragore Huge (16000 terms) Weakly structured (2-level deep, multi-inheritance) Alternative lexical forms Definitions IconClass Huge (>24000 subjects) Richly structured : 10 level hierarchy, cross-references Compound concepts: keys, structural digits… Keywords

45

46 Manuscripts, Aim Integrated access All illuminations via Mandragore vocabulary All illuminations via Iconclass vocabulary

47 Summary Global presentation of project and past work Cases the project is currently focusing on KB internal case Illuminated Manuscripts from KB and BNF Iconclass Scientific problems STITCH between scientific research and CH domain

48 Iconclass Iconclass contains complex information Links between normal subjects and possible qualifiers Compound concepts Local extensions Existing representation is mostly text-based Aims Building a Semantic Web-enabled complete representation Dedicated ontology Conversion process implemented -> 1.2 M RDF triples Providing this representation as a (web) service As well as a standard SKOS version

49 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evualuation STITCH between scientific research and CH domain

50 Steps Gathering vocabulary and collection data Analyzing it Transforming it using SW standards All record/vocabulary information in one repository

51 Conversion into RDF specific model and SKOS BNF export tagInterpretationConversion to Record-specific RDFS model SKOS interpretation Descriptor about=namespace+’d_’+@id Concept @ididmd:descriptor-id inScheme #descripteursSchemeinScheme #mandragoreScheme libellelabelmd:hasPreferredLabelprefLabel xml:lang="fr" descriptionDefinition notemd:hasDefinitiondefinition formes- rejetees/forme- rejetee/libelle (+ optional /description) Rejected label (and optional description) md:hasRejectedForm which points at an anonymous [RejectedForm] resource with hasRejectedLabel and hasRejectedLabelDefinition with textual values respectively set to the content of libelle and description elements (using rdf:parseType=Resource) altLabel xml:lang="fr" + definition notesComplementary definitionmd:hasNotenote codes-dewey/code- dewey Thematic classification (given by DDC code which is attached to a classification element) md:hasThematicClassificationbroader

52 Transformation into SKOS Example grégoire 11 pierre roger de beaufort cardinal diacre de sainte-marie-nouvelle, pape Conversion of thesauri main features Preferred and alternative labels Semantic relationships (BT, RT) Notes (scope notes, definitions)

53 Collection Formalization Problems Interpreting and representing vocabularies using formal standards is hindered by expressivity variation Complex models Non-standard features Fuzzy structures, weakly structured Some information is lost when converting to SKOS Qualifiers Compound concepts Relation between terms (not only concepts) We kept complete models Adhoc ontologies, cf. Iconclass

54 Collection Formalization Problems System-specific conversions were done Depending on application environment Standard RDFS expressivity and implemented tools Depending on the mapping tools, which might make different hypotheses on the nature of knowledge to align OWL classes vs. nodes in trees

55 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain

56 Steps Provide mappers with vocabulary data Proceed to evaluation/selection of their results Put the alignment in the repository

57 Lessons learned: Collection Integration We have ontology mappers, not thesaurus mappers Input: pre-processing to pure RDFS/OWL ontologies Mapping process Using resources that may be absent from CH vocabularies Rich formal/structural information Not (properly) using all information found in CH vocabularies E.g. rich lexical information Output: needs re-interpretation of mapping relations

58 Automatic Ontology Matching Techniques Generally aiming at recognizing equivalence or subsumption links between ontology elements Lexical Labels of entities, textual definitions Structural Structure of the formal definitions of entities, position in the hierarchy Statistical Objects, instantiation of the concepts Shared background knowledge (“oracles”) Using conceptual references to deduce correspondences Most mapping tools use a mix of such approaches E.g. lexical string matching can ignite a structural alignment process brainLongtumor Long

59 Alignment: lessons learned from previous experiments Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data

60 Alignment: here Lexical approaches Should use everything in a thesaurus Structural approaches Useless (even harmful) in a context where hierarchical information is weak Using background knowledge Needs to find proper resource/dictionary (Wordnet) Statistical approaches Needs dually classified data

61 Lexical alignment: Manuscripts case [Monolingual case, since IC comes in French] Basic label comparison Preferred labels Alternative labels Going beyond labels Labels and definitions IC keywords and Mandragore labels Lexical information as bags-of-words Words in IC (glossy labels) found in Mandragore labels, and vice versaWords in IC (glossy labels) found in Mandragore labels Words in Mandragore definitions found in IC labels

62 Lexical alignment: Manuscripts case From 430 to 21000 found matches Some redundant Some comparisons bringing quite some noise Interesting is that we have a gradation Interesting coverage for the application 12300 Mandragore terms accessible from an IC term 22800 IC terms accessible from a MG term Fuzziness of original hierarchies allows for (associative) noise Problems: Better NLP treatments (e.g. lemmatization) Choice of proper alignment link depending on the features compared

63 Lexical alignment: Manuscripts case broaderEquivalent

64 Demo Corn

65

66

67

68 Statistic approach: KB case

69 Comparing documents indexed with BK concepts and documents indexed with GTT concepts Overlap measure

70 Statistic approach: problems Finding threshold to filter resultsresults Taking into account thesaurus use Levels of indexing are different Statistical significance is not granted Overlap measure is less significant when concepts are used only a few times

71 Using background knowledge Interesting research BK brings additional structural semantics to concepts BK brings more lexical knowledge (synonyms) in the loop Problem: needs to find proper resource/dictionary Domain-specific vocabularies Language-specific vocabularies

72 First experiments on anchoring GTT to Wordnet [with Véronique Malaisé, CHOICE] Setting Using an online Dutch-English dictionary Comparing translations (and definitions) found with Wordnet content Nice feature: many GTT are already manually translated Results Poor recall: 9% of concepts for which there was a manual translation were anchored to WN Problems: Encoding Domain-specificity Complex terms Results are better with another vocabulary

73 Other Dutch vocabularies hanging around

74 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems Solving representation heterogeneity Solving conceptual heterogeneity Evaluation STITCH between scientific research and CH domain

75 A transversal problem: evaluation Assessing quality of mapping In a specific context Taking into account Use of thesauri (indexing levels) Integration aim (hierarchical browsing) Designing evaluation tools Methods to evaluate samples to guide mapping process at a low cost And yet have statistical relevance

76 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community

77 Pushing Research Results to the CH World Paper publications European Conference on Digital Libaries 2006 Informatie Professional 2006 Dissemination papers on SKOS and OWL Talks, demonstrations done and planned Digital Erfgoed Conference BNF KB RNA demo middag UDC seminar Lecture for Masters on Book & Digital Media (Leiden) SKOS tutorial @ CATCH day

78 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community

79 Concrete Collaborations Collaborations with CH institutes Digitaal Erfgoed Nederland Creation of a thesaurus inventory questionnaire KB experts Illuminated Manuscripts Operational departments BNF Illuminated Mansucripts Rijksbureau Kunsthistorische Documentatie Iconclass [Illuminare (Leuven)]

80 Pushing Research Results to the CH World CH-oriented research projects The European Library Research proposal on multilingual thesaurus alignment CATCH Rijksmuseum collections (CHIP) Anchoring GTT to Wordnet (CHOICE) Metadata Recommendation (CHOICE, MITCH) Iconclass and GTAA Service (CHOICE) Mapping and vocabulary repository (CHOICE) RNA

81 Summary Global presentation of project and past work Cases the project is currently focusing on Scientific problems STITCH between scientific research and CH domain Pushing research results to the CH world Concrete collaborations Bringing domain problems to the research community

82 Confrontation of existing SW tools to real CH data VU talks and collaborations External collaboration (Trento) Papers BNAIC SWI Prolog and the Web (Theories and Practices for logic programming) Participation in W3C Semantic Web Deployment working group Editor of SKOS use cases and requirements document Contribution of Manuscript and Iconclass use cases

83 Free discussion


Download ppt "STITCH project CATCH User Group January 30th 2007."

Similar presentations


Ads by Google