2018/4/14 SMC4LRT Semantic Mapping Component for Language Resources and Technology 2011-06-06 Matej Ďurčo, ICLTT, Vienna;
Context on Language Resource and Technology 2018/4/14 2018/4/14 Context on Language Resource and Technology CLARIN – Common Language Resources and Technology Infrastructure CMDI - CLARIN Metadata Infrastructure heterogeneous collection of (Metadata about) Resources ISOcat (ISO 12620) - a framework within ISO TC 37 for defining: Data Categories – Definitions of widely accepted linguistic concepts apply Semantic Technologies Ontology Mappping / Schema Mapping Ontology Browsing / Visualization Linked Open Data
Main Goal/s Enhance Metadata Search → Semantic Search Basic Idea 2018/4/14 2018/4/14 Main Goal/s Enhance Metadata Search → Semantic Search Basic Idea query: + relations: (#DatCat) = expanded query: (Class level) Semantic Browsing - Browse Metadata/Resources via ontologies (LT-World) (Instance-Level) Interoperability / Reuses - Connect dataset to Linked Open Data Actor.Name any Peter #sameAs (#Actor, #Person) #sameAs (#Name, #FullName) Actor.Name any Peter OR Actor.FullName any Peter OR Person.Name any Peter OR Person.FullName any Peter
Definitions Vocabulary, Lexicon, Ontology Term, Category, Concept ? 2018/4/14 2018/4/14 Definitions Vocabulary, Lexicon, Ontology Term, Category, Concept ? MD Profile / Schema MD Description
2018/4/14 2018/4/14 Components DataCategoryRegistry - isocat DCR (ISO/TC37) Define/Standardize a reusable set of (basic) data categories CMDI - ComponentRegistry define profiles/schemas at will, but reference DatCats! CMDRSB - Repository/Service/Browser CMDI exploitation-side trinity http://clarin.aac.ac.at/MDService2/ RelationRegistry allows defining relations between DatCats VLO - Virtual Language Observatory faceted browser for CLARIN Metadata, maps all hetergeneous information from all profiles to 10 facets! VAS – Vocabulary Alignment Service (CATCHPlus.nl) find concept to literal, find aligned concepts LT-World - Domain ontology
2018/4/14 2018/4/14 Components - CMDI
Components - dependencies 2018/4/14 2018/4/14 Components - dependencies
Approach – Class/Concept level 2018/4/14 2018/4/14 Approach – Class/Concept level Use linkage: Profiles → Data Categories ← Relation Registry just mapping based on the ConceptLink resolvable via ComponentRegistry different Profile/Elements pointing to the same DatCat use Information from Relation Registry: a) equivalence relation between DatCats b) equivalence relation also between Component DatCats (yet to come) c) use also other relations in Relation Registry (subClassOf, synonymy?, …) Apply selected (user-defined) relation-sets from Relation Registry <CMD> MDRecord <Header> <MdProfile>{profileID}</MdProfile> <Components><{profileName}> <{component}> <{element}> CMD-Profile-Specification <CMD_ComponentSpec> <Header><ID>{profileID}</ID>...</Header> <CMD_Component name=“{profileName}"> <CMD_Component name=“{component}"> <CMD_Element name=“{element}“ ConceptLink=“{datcat-uri}”> Data Category Registry <dcif:dataCategorySelection> <dcif:dataCategory pid=“{datcat-uri}“ > {detail-information} <rdf:RDF> Relation Registry <rdf:Description rdf:about="{datcatX-uri}“> <sameAs rdf:resource="{datcatY-uri}"/> </rdf:Description>
Approach – Individuals/Instance Level 2018/4/14 2018/4/14 Approach – Individuals/Instance Level One step when (pre)processing incoming new MD-sets Express MD-Records as RDF-triples: Identify potential target Domain Ontologies/Vocabularies Create inverted Index: Define lookup function: Enrich dataset with new facts: Property-values of Metadata-Records are linked to instances of domain-ontologies <#mdrecord #property “string-value”> Category Label Entity dc:Organization „MPI“ #MPI „Max-Planck...“ „DFKI“ #DFKI „De Fo Kü In“ skos:LCSH „19th Poetry“ lcsh:19thPoetry skos:DDC ddc:19thPoetry label → entity lookup(category, string-value) → <external-entity, measure> <#mdrecord #property #external-entity>
Semantic Mapping - Linking and Data Flow 2018/4/14 2018/4/14 Semantic Mapping - Linking and Data Flow INCONSISTENT
Semantic Search - Query sequence 2018/4/14 2018/4/14 Semantic Search - Query sequence
Candidate Categories/Properties 2018/4/14 2018/4/14 Candidate Categories/Properties ResourceType, Format, AnnotationLevelType → map to: isocat-DataCategories (Thematic Views: Metadata, Morphosyntax, ...) Genre, Topic, Subject → map to: Taxonomies, Library Classification systems (LCSH, DDC, Dornseiff,...) Project, Institution, Person, Publisher open controlled vocabularies (real entities) → map to: LT-World (perhaps others: LCCN, DBPedia?)
2018/4/14 2018/4/14 Expected Results Specification + Prototype of a Semantic Mapping Component allowing to transform CMD-Metadata into RDF Specification + Prototype of a Semantic Search Component REST-WebService enriching the MD-Search, allowing query expansion and ontology/concept-based search CLARIN Metadata expressed as RDF/LOD-Dataset
Next Steps Literature → Related Work Linked Open Data Ontology Mapping 2018/4/14 2018/4/14 Next Steps Literature → Related Work Linked Open Data Ontology Mapping Ontology Browsing/Visualization Analyze Data Existing MD-Schemas (DC, OLAC, MODS, TEI, IMDI, CMD, ...) LT-World Ontology SKOS-Data available via Vocabulary Alignement Service LCSH, LCCN DBPedia
2018/4/14 2018/4/14 References - LRT [1] D. V. Uytvanck, C. Zinn, D. Broeder, P. Wittenburg, and M. Gardellini, \Virtual language observatory: The portal to the language resources and technology universe," in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, eds.), (Valletta, Malta), European Language Resources Association (ELRA), May 2010. [2] D. Broeder, M. Kemps-Snijders, D. V. Uytvanck, M. Windhouwer, P. Withers, P. Wittenburg, and C. Zinn, \A data category registry- and component-based metadata framework," in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, eds.), (Valletta, Malta), European Language Resources Association (ELRA), May 2010. [3] ISO12620:2009, \Computer applications in terminology { data categories {specification of data categories and management of a data category registry for language resources," 2009. [4] E. Hinrichs, P. Banski, K. Beck, G. Budin, T. Caselli, K. Eckart, K. Elenius, G. Faa, M. Gavrilidou, V. Henrich, V. Quochi, L. Lemnitzer, W. Maier, M. Monachini, J. Odijk, M. Ogrodniczuk, P. Osenova, P. Pajas, M. Piasecki, A. Przepiorkowski, D. V. Uytvanck, T. Schmidt, I. Schuurman, K. Simov, C. Soria, I. Skadina, J. Stepanek, P. Stranak, P. Trilsbeek, T. Trippel, and I. Vogel, \Interoperability and standards," deliverable, CLARIN, March 2011. [5] B. Jörg, H. Uszkoreit, and A. Burt, \Lt world: Ontology and reference information portal," in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, eds.), (Valletta, Malta), European Language Resources Association (ELRA), May 2010.
References – Semantic Technologies 2018/4/14 2018/4/14 References – Semantic Technologies [5] B. Jörg, H. Uszkoreit, and A. Burt, \Lt world: Ontology and reference information portal," in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, eds.), (Valletta, Malta), European Language Resources Association (ELRA), May 2010. [6] Y. Kalfoglou and M. Schorlemmer, \Ontology mapping: the state of the art," The Knowledge Engineering Review, vol. 18, pp. 1{31, Jan. 2003. [7] P. Shvaiko and J. Euzenat, \Ten challenges for ontology matching," in On the Move to Meaningful Internet Systems: OTM 2008 (R. Meersman and Z. Tari, eds.), vol. 5332 of Lecture Notes in Computer Science, pp. 1164{1182, Springer Berlin / Heidelberg, 2008. 10.1007/978-3-540-88873-4 18. [8] M. Ehrig and Y. Sure, \Ontology mapping { an integrated approach," in The Semantic Web: Research and Applications (C. Bussler, J. Davies, D. Fensel, and R. Studer, eds.), vol. 3053 of Lecture Notes in Computer Science, pp. 76{91, Springer Berlin / Heidelberg, 2004. 10.1007/978-3-540-25956-5 6. [9] S. Noah, N. Alias, N. Osman, Z. Abdullah, N. Omar, Y. Yahya, and M. Yusof, \Ontology-driven semantic digital library," in Information Retrieval Technology (P.-J. Cheng, M.-Y. Kan, W. Lam, and P. Nakov, eds.), vol. 6458 of Lecture Notes in Computer Science, pp. 141-150, Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-17187-1 13. [10] T. Berners-Lee, \Linked data." online: http://www.w3.org/DesignIssues/LinkedData.html, 07 2006. Status: personal view only. Editing status: imperfect but published. Last visited: 2011-04-13. [11] T. Heath and C. Bizer, \Linked data: Evolving the web into a global data space," Synthesis Lectures on the Semantic Web: Theory and Technology, vol. 1, pp. 1-136, Feb 2011.
Tasks / Open Issues (Who/How) Define Concept-Level Relations 2018/4/14 2018/4/14 Tasks / Open Issues (Who/How) Define Concept-Level Relations (Vocabulary Service http://catchplus.tuxic.nl/catchplus/serviceapi/1/) Populate Vocabulary service translate Ontologies, Taxonomies Express MDRepo in RDF every profile is one Ontology Every MDRecord is an instance Ontology Mapping (compute similarities between profiles and between instances)
Questions/Discussion 2018/4/14 2018/4/14 Questions/Discussion Distinguish between relations (is it type vs. subclass?) ISA, a-kind-of = type subsumption (hypo/hyperonymy) = subClassOf Resource-Level: Annotation-Tiers of Resources are conceptLinked to DatCats Values of Annotation-Tiers are linked to DatCats Thierry: user rather Computer Linguist within an application (relevant in META-NET) How to employ Linguistic Ontologies? Lemon/LingInfo, isocat, GOLD, wals.info, Wordnet? Thierry: shouldn't be necessary, mainly for OntoPopulation from texts
2018/4/14 2018/4/14 MDService - Basics MDService accepts queries about metadata from MetadataBrowser (and external Applications) and passes them to the Metadata Repository(ies) and/or to the Virtual Collection Registry, optionally applying Semantic Mapping based on the information from Component Registry, Data Category Registries and Relation Registry receiving results and passing them (optionally formatted) back to the requesting node.
MDService - Functionality 2018/4/14 2018/4/14 MDService - Functionality REST-interface (trac:WADL, MDService2/docs/htmlpage/wadl) collections list the „natural“ hierarchical collections-structure of the repository model return xml-elems used in the repository (with usage statistics) terms return terms/indices/xml-elems used in the repository enriched with a) the usage statistics (count occurrences and distinct values) b) the corresponding CMD-components and data categories values list distinct values for given index (similar to facet functionality) recordset retrieve a list of MDrecords based on a query [CQL] record retrieve individual MDrecord based on the identifier
MDBrowser - Functionality 2018/4/14 2018/4/14 MDBrowser - Functionality http://clarin.aac.ac.at/MDService2/docs/htmlpage/info Dynamic Repositories Collections browsing Terms/Values browsing Query Input Simple full-text query Complex queries (CQL-searchclauses, boolean op) Index auto-completion Queryset/Resultset work with multiple results in parallel Paging Variable views (select columns, auto-columns) Workspace (storing queries, bookmarks) „Linkable“ Queries (Semantic Mapping)
CMDRSB - Situation and Outlook 2018/4/14 2018/4/14 CMDRSB - Situation and Outlook The MDRepository currently contains around 109.000 records, mainly from the datasets: OLAC and IMDI (overview of collections) Currently there are three instances of the MDRepository running providing similar but not identical datasets: University of Gothenburg (main) ICLTT, Vienna MPI Psycholing, Nijmegen A first version of the MDService and Browser is online: clarin.aac.ac.at/MDService2 Although the repository and interface already provide a lot of information and functionality, it is demo-quality and cannot yet be seen as reliable service. Lot of work is still needed both on the data quality and user interface: Enhancing the UI (based on feedback from Nijmegen 201101 2011-05) continuous integration of new datasets (provided for harvest by the centres) Nevertheless we invite you to try it out and look forward to any critical remarks (they can be accessed by the same MDService, by switching the target repository in the UI)