Aug. 14, 2012 2012 IASLOD Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi.

Slides:



Advertisements
Similar presentations
Digital Repositories – Linked Open Data – the possible Role of D4Science Workshop, December 2010, FAO use cases A tool to create Linked Data providers.
Advertisements

1 FP7SESAMFORCE Reporting Tools Access through the Participant Portal Reporting Tools Access.
Persistent identifiers – an Overview Juha Hakala The National Library of Finland
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY Matthew Williams
A New Learning Tools. Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.
Lecture 13 Revision IMS Systems Analysis and Design.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Overview of Software Requirements
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
CS 290C: Formal Models for Web Software Lecture 6: Model Driven Development for Web Software with WebML Instructor: Tevfik Bultan.
Advanced Data Mining and Integration Research for Europe ADMIRE – Framework 7 ICT ADMIRE Overview European Commission 7 th.
Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December /02/11.
DartGrid Browser-based mapping tool of SQL to RDF Point Template Zhejiang University & OpenLink Software.
Developing Enterprise Architecture
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
Systems analysis and design, 6th edition Dennis, wixom, and roth
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
© Copyright 2012 STI INNSBRUCK
Chapter 1: The Object-Oriented Systems Development Environment Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
WP3 System Architecture & System Integration By (Stein) Runar Bergheim Asplan Viak Internet.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Master Informatique 1 Semantic Technologies Part 11Direct Mapping Werner Nutt.
VIRTUAL INFORMATION AND KNOWLEDGE ENVIRONMENT FRAMEWORK IP-FP
MIS 327 Database Management system 1 MIS 327: DBMS Dr. Monther Tarawneh Dr. Monther Tarawneh Week 2: Basic Concepts.
CLARIN work packages. Conference Place yyyy-mm-dd
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
, 1/21, © Library and Documentation Systems Division 21 st APAN Meeting Tokyo, January 2006 AGROVOC and AOS, Margherita Sini, FAO From.
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY WITHIN THE (SEMANTIC) WEB Matthew Williams
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Week 7 Lecture 2 Globalization Support in the Database.
Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA.
ResistVir-Db The database of ResistVir European Project Co-ordination of Research on Genetic Resistance to Plant Pathogenic Viruses, and their Vectors,
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
A Fedora 3 to 4 Migration Case Study for UNSW Australia Library Fedora 4 Training Workshop, eResearch Australasia 2015, Brisbane UNSW Library Arif Shaon,
A Fedora 3 to 4 Migration Case Study for UNSW Australia Library Fedora 4 Training Workshop, eResearch Australasia 2015, Brisbane UNSW Library Arif Shaon,
Semantic Phyloinformatic Web Services Using the EvoInfo Stack Speaker: John Harney LSDIS Lab, Dept. of Computer Science, University of Georgia Mentor(s):
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Topic Maps introduction Peter-Paul Kruijsen CTO, Morpheus software ISOC seminar, april 5 th 2005.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
May 2010 GGIM, New York City The National System for Coordination of Territorial Information SNIT NSDI of Chile.
Implementation recommendations 1st COPRAS review Presentation at 2nd COPRAS annual review, 15 March 2006, CEN/CENELEC meeting centre, Brussels Bart Brusse.
Technician Table Editor Academic advisor : Professor Ehud Gudes Technical advisor : Menny Even Danan Team: Olga Peled Doron Avinoam Ira Zaitsev ADD Presentation.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Indicate Research Pilots An e-Infrastructure enabled semantic search service Technical Conference Catania 20/04/2012 NTUA Kostas Pardalis 1.
Linked Open Data Approaches within the ARIADNE project
Eurostat activities update
ESS roadmap on Linked Open Data State of play
Extracting Semantic Concept Relations
[jws13] Evaluation of instance matching tools: The experience of OAEI
DBpedia 2014 Liang Zheng 9.22.
LOSD Publication Deirdre Lee
LOD reference architecture
Linked Data Ryan McAlister.
SDMX IT Tools SDMX Registry
Cultivating Semantics for Data in Agriculture and Nutrition
Presentation transcript:

Aug. 14, IASLOD Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi

- 1 - Agenda Project Scope System Architecture Silk in Action Korean Traditional Knowledge Data Localization Issues

- 2 - LOD2 Work Packages The project is structured into twelve consecutively numbered work packages (WPs). WP1 to WP6 are concerned with development of the LOD2 Stack, and WP7 to WP9 are designed to extensively validate and demonstrate the developed technology on the basis of a carefully selected and representative set of demonstrator applications, holding potentially great impact. WP10 (SWC) is devoted to training, awareness and dissemination, WP11 is concerned with exploitation and standardization activities, as well as technical coordination activities with other projects. WP12 is designed for high-level project coordination, reporting to the EC as well as activities related to the resolution of the IPR and maintenance of the Consortium Agreement.

- 3 - Simplified LOD2 Stack High-Level Architecture The main result of LOD2 will be the LOD2 Stack, an integrated distribution of aligned tools which support the whole life cycle of Linked Data from creation over enrichment, interlinking, fusing to maintenance.

- 4 - Project Scope: Tasks & Deliverables In Task4.1, a semi-automatic machine learning technique will be developed and implemented to simplify the creation of mappings between knowledge bases and the assessment of their quality. KAIST will contribute to this task by providing a platform for automatic linking with Korean, Chinese, and Japanese RDF resources. Task 4.1 Semi-Automatic Data Interlinking - University Leipzig - Digital Enterprise Research Institute - Free University Berlin - KAIST Deliverable First Linking Assist Release Due Date: M18 ( ) Deliverable Korean Resource Linking Assist Release Due Date: M24 ( ) Deliverable Asian Resource Linking Assist Release Due Date: M30 ( )

- 5 - Project Scope: Tasks & Deliverables (Cont’d) Task 4.5 Link Data Fusion - University Leipzig - Digital Enterprise Research Institute - Free University Berlin - KAIST Deliverable Initial Release of Data Fusion Component Due Date: M24 ( ) Deliverable Korean Data Fusion Assistant Due Date: M30 ( ) Deliverable Asian Data Fusion Assistant Due Date: M36 ( ) In Task 4.5, methods for fusing data about single concept from multiple different sources will be devised and implemented. KAIST will work on the fusion of multilingual DBpedia datasets, thus eliminating issues for other multilingual resources.

- 6 - Phased Approaches 2 nd Cycle(~July, 2012) Implementation of Korean Resource Linking Assistant Silk Localization Linking with Silk Framework Internal publication 1 st Cycle(~Feb., 2012) Understanding of the Task Domain Semantic Web LOD2 Concept Software Architecture Data Model(Relational2RDF) Pilot Project Korean Traditional Recipe data 3 rd Cycle(~Aug., 2012) Quality Enhancement Linking Quality Publish to the LOD2 cloud The project has been done in 3 iterative cycles. Each cycle focuses on specific tasks, and lessons learned will be transferred into the next cycles. In the 1 st cycle, preliminary RDF data was generated. During the second cycle, we localized Silk to support Korean resource linking. The last cycle focuses on enhancing data quality.

- 7 - Silk in Action url: File or SPARQL endpoint can be sources or targets. Define a project Define a source & a target Define a task Define an output And then click Open

- 8 - Silk in Action (Cont’d) Multiple operators can be used for complex tasks. Outputs can be displayed or written into a file. Interim result can be exported as a final result or be used as training data sets for machine learning. Learned algorithm can be used to generate final links. Define a source & a target from Property Paths Define operator(s) Click GenerateLinks Click Start

- 9 - Korean Traditional Knowledge Portal

Korean Traditional Knowledge Data includes – Food (3,236 records) Food name Food type Recipe, ingredients Cooking process (images) – Medicine, sickness, and treatment (38,121 records) – Agriculture (2,775 units) – Life (4,438 units)

System Architecture Source Data in Relational DB Silk Virtuoso Triple Store Proprietary RDFgen for transforming relational model to RDF model Silk for link generation Virtuoso triple store for serving RDF RDFgen* Link Creation Silk New Korean Similarity Measures Transformation RDFgen Publication Virtuoso triple store RDF Links Instances Ontology DBpedia

Key Linking Issues Data Preprocessing Address Encoding: URI vs.IRI Korean String Similarity Measure Handling Transliterated Data

Data Preprocessing : Mapping Relation to RDF Our goal is to make the recipes of Korean traditional food open. Original data from relational database were transformed into tables by object relational mapping. Related ontologies for recipe: LinkedRecipe.com, Tool and IngredientPortion are not implemented at this phase. RelationalRDF Table nameClass name PK column valueSubject Non-PK column namePredicate FK column valueObject(used as URI; RDF link) Non-FK column valueObject(used as string; Literal triple)

Handling Non-Latin Data Resources would be described in non-Latin characters. Tools are not known whether to support non-Latin characters. Writing Systems of the world today - Wikipedia

Address Encoding URI is a core component of linked data. URIs are used as names for things. URI only allows US-ASCII characters for names of the resource. W3 Recommendations for URI : UTF-8 Character Set & URI Encoding Use UTF-8 character sets for URI, and encode special/non-Latin characters using %. ex) But it’s hard to understand what it is… Another W3 Recommendations : IRI(Internationalized Resource Identifier) ex) 베를린 Now we can understand what it means. But some characters look so similar that chance for spoofing increases. ( ex)   Å

Localization: Silk Workbench Address Encoding Silk Workbench is GUI interface for the generation of links Silk Workbench displays encoded URIs ‘as is’ so that it’s hard to understand non-Latin dataset. Decoding URIs enables non-Latin dataset to be displayed in its native language, so it’s a lot easier to work with.

Localization: Korean String Similarity Measures Two kinds of Korean resources exist: Resources in Korean and resources in transliterated Korean. We need to calculate similarity distances for both of them. Korean alphabet has 14 consonants and 10 vowels (together with consonant clusters and diphthongs). For resources in Korean ‘ 비빔밥 ’ i.e., Korean DBpedia Most of the resources in Korea For resources in transliterated Korean ‘bibimbap’ i.e., English DBpedia Most of the resources abroad Most of the comparators in Silk are based on string comparison i.e., Levenshtein distance However, writing systems are different from languages to languages. So comparators for Latin or Roman alphabets are appropriate for Korean alphabet? String Similarity Distance Measures for Korean KorED GrpSim OneDSim2 KorPhoD (Our approach) = (sD-1)*3 + min(pD), sD:Syllable Distance, pD:

Localization: Korean String Similarity Measures (Cont’d) Several Korean similarity distances exist to reflect the characteristics of Korean alphabet. We devised a new way to measure based on the distribution of phonemes (KorPhoD). We implemented KoreanPhonemeDistance operator in Silk and used it to build links among Korean resources. SourceTargetLevenstein DistanceActual Edit OperationDifferences in phonemesDifferences in syllables 녹차모과차23 (ㅁ->ㄴ, ㄱ-> add, 과-> delete)42 SourceTargetLevenshtein DistanceKorEDGrpSimOneDSim2KorPhoD 우연히망연이2  +  +  *ws(‘ㅇ’ and ‘ㅎ’ are similar)  +  *w3  +  강낭콩뿔난콩2  +  +  *wd(‘ㅇ’ and ‘ㄴ’ are different)  +  *w4  +  일반통계학일방통행3  + 3  +  *wd+  *w+  *wd2  +  바람보름222 +   : syllable distance,  : phoneme distance Comparison of Similarity Measures for Korean Application of Edit Distance to Korean Resources Performance Comparison Precision : 1.28% vs % (about thirteen times improvement ) F-score: vs (Four times more effective finding correct links)

Localization: Transliterated Korean Similarity Measures Two kinds of transliteration related to Korean: From English to Korean / From Korean to English. For now, we focus on the transliteration from Korean to English to build links for resources in Korean. The biggest problem is that there have been various algorithms for transliterating Korean into English so far. From English to Korean ‘Digital’ -> ‘ 디지털 ’, ‘ 디지틀 ’, ‘ 디지탈 ’, … From Korean to English ‘ 칼국수 ’ -> ‘Kalguksu’, ‘Kalguksoo’, ‘Kalgugsoo’, … Transliteration algorithms for Korean McCune-Reischauer(1937) : Official standard in the past (from 1984 to 2000) Uses breves( ˘: indicates a short vowel), apostrophes and diereses(¨ : a vowel is sounded in a separate syllable)brevesapostrophesdiereses Yale(1942) Revised Romanization(2000) : Current official standard. Is generally similar to MR, but uses no diacritics or apostrophes, and uses distinct letters for ㅌ / ㄷ (t/d), ㅋ / ㄱ (k/g), ㅊ / ㅈ (ch/j) and ㅍ / ㅂ (p/b), etc. and probably many more… We found that many academic and government websites still use MR more. Silk doesn’t have phonetic similarity measures though… i.e., Soundex

Localization: Transliterated Korean Similarity Measures (Cont’d) We compare performance from both string similarity perspective and phonetic similarity perspective. Levenshtein shows good performance for precision, and Soundex shows good performance for recall. KoTlit shows good performance for both precision and recall, and we are still optimizing the algorithms. Performance Comparison M.R.RelevantRetrievedRet. & Rel.Precision(%)Recall(%) Levenshtein* Soundex KoTlit * threshold:0 R.R.RelevantRetrievedRet. & Rel.Precision(%)Recall(%) Levenshtein* Soundex KoTlit * threshold:0

Concluding Remarks Localization issues are important for Asian and other non-Latin countries Need to develop its own similarity measures – string similarity and phonetic similarity SILK is likely to become a key linking assistant program for LOD LOD is a major movement to define the next version of the Internet.

Thank you! Mun Yong Yi KAIST 지식서비스공학과 mail: