Incremental Export of Relational Database Contents into RDF Graphs Nikolaos Konstantinou, Dimitris Kouis, Nikolas Mitrou By Dr. Nikolaos Konstantinou National.

Slides:

Advertisements

Similar presentations

© 2006 IBM Corporation Features of an Enterprise-ready Triple Store Ben Szekely June, 2006.

Advertisements

RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.

IiWAS2002, Bandung, Indonesia Teaching and Learning Databases Dr. Stéphane Bressan National University of Singapore.

RDF Databases By: Chris Halaschek. Outline Motivation / Requirements Storage Issues Sesame General Introduction Architecture Scalability RQL Introduction.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Chapter 14 The Second Component: The Database.

Semantic Web Query Processing with Relational Databases Artem Chebotko Department of Computer Science Wayne State University.

Introduction to Database Management

BUSINESS DRIVEN TECHNOLOGY

Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.

ONTOLOGY ENGINEERING Lab #9 - November 3, Linking Relational Databases to Ontologies 2  Relational databases are still a common means of storing.

Class 6 Data and Business MIS 2000 Updated: September 2012.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.

RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.

Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.

An Approach for the Incremental Export of Relational Databases into RDF Graphs N. Konstantinou, D.-E. Spanos, D. Kouis, N. Mitrou National Technical University.

Lecture 9 Methodology – Physical Database Design for Relational Databases.

Logics for Data and Knowledge Representation

Computer Measurement Group, India Optimal Design Principles for better Performance of Next generation Systems Balachandar Gurusamy,

Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.

Chapter 7: Database Systems Succeeding with Technology: Second Edition.

Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.

MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.

KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.

Relational Databases to RDF (a.k.a RDB2RDF) Juan F. Sequeda Dept of Computer Science University of Texas at Austin.

Master Informatique 1 Semantic Technologies Part 11Direct Mapping Werner Nutt.

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Relational Database CISC/QCSE 810 some materials from Software Carpentry.

NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.

1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.

Chapter 10: The Data Tier We discuss back-end data storage for Web applications, relational data, and using the MySQL database server for back-end storage.

SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.

Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Methodology – Physical Database Design for Relational Databases.

C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.

RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.

Web Information Systems Modeling Luxembourg, June VisAVis: An Approach to an Intermediate Layer between Ontologies and Relational Database Contents.

Practical RDF Chapter 10. Querying RDF: RDF as Data Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Lim.

Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.

Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.

RDF and Relational Databases

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Session 1 Module 1: Introduction to Data Integrity

20 Copyright © 2008, Oracle. All rights reserved. Cache Management.

Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Knowledge Technologies Manolis Koubarakis 1 Some Other Useful Features of RDF.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.

SQL Basics Review Reviewing what we’ve learned so far…….

Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.

Introduction to Database Programming with Python Gary Stewart

Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,

Trigger used in PosgreSQL

INLS 623– Database Systems II– File Structures, Indexing, and Hashing

RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.

Methodology – Physical Database Design for Relational Databases

國立臺北科技大學課程：資料庫系統 fall Chapter 18

Presentation transcript:

Incremental Export of Relational Database Contents into RDF Graphs Nikolaos Konstantinou, Dimitris Kouis, Nikolas Mitrou By Dr. Nikolaos Konstantinou National Technical University of Athens School of Electrical and Computer Engineering Multimedia, Communications & Web Technologies 4th International Conference on Web Intelligence, Mining and Semantics

Outline Introduction Proposed Approach and Measurements Discussion and Conclusions 24th International Conference on Web Intelligence, Mining and Semantics

Introduction Information collection, maintenance and update is not always taking place directly at a triplestore, but at a RDBMS Triplestores are often kept as an alternative content delivery channel It can be difficult to change established methodologies and systems Newer technologies need to operate side-by-side to existing ones before migration 34th International Conference on Web Intelligence, Mining and Semantics

Mapping Relational Data to RDF No one-size-fits-all approach Synchronous Vs Asynchronous RDF Views Real-time Vs Ad hoc RDF Views Real-time SPARQL-to-SQL Vs Querying the RDF dump using SPARQL Queries on the RDF dump are faster in certain conditions, compared to round-trips to the database Difference in the performance more visible when SPARQL queries involve numerous triple patterns (which translate to expensive JOIN statements) In this paper, we focus on the asynchronous approach Exporting (dumping) relational database contents into an RDF graph 44th International Conference on Web Intelligence, Mining and Semantics

Incremental Export into RDF (1/2) Problem Avoid dumping the whole database contents every time In cases when few data change in the source database, it is not necessary to dump the entire database Approach Every time the RDF export is materialized Detect the changes in the source database or the mapping definition Insert/delete/update only the necessary triples, in order to reflect these changes in the resulting RDF graph 54th International Conference on Web Intelligence, Mining and Semantics

Incremental Export into RDF (2/2) Incremental transformation Each time the transformation is executed, not all of the initial information that lies in the database should be transformed into RDF, but only the one that changed Incremental storage Storing (persisting) to the destination RDF graph only the triples that were modified and not the whole graph Only when the resulting RDF graph is stored in a relational database or using Jena TDB Regardless to whether the transformation took place fully or incrementally 64th International Conference on Web Intelligence, Mining and Semantics

R2RML and Triples Maps RDB to RDF Mapping Language A W3C Recommendation, as of 2012 Triples Map: a reusable mapping definition Specifies a rule for translating each row of a logical table to zero or more RDF triples A logical table is a tabular SQL query result set that is to be mapped to RDF triples Execution of a triples map generates the triples that originate from a specific result set (logical table) 74th International Conference on Web Intelligence, Mining and Semantics

The R2RML Parser An R2RML implementation Command-line tool that can export relational database contents as RDF graphs, based on an R2RML mapping document Open-source (CC BY-NC), written in Java Publicly available at Tested against MySQL and PostgreSQL Output can be written in RDF/OWL N3, Turtle, N-Triple, TTL, RDF/XML notation, or Jena TDB backend Covers most (not all) of the R2RML constructs (see the wiki)wiki Does not offer SPARQL-to-SQL translations 84th International Conference on Web Intelligence, Mining and Semantics

Outline Introduction Proposed Approach and Measurements Discussion and Conclusions 94th International Conference on Web Intelligence, Mining and Semantics

Information Flow Parse the source database contents into result sets According to the R2RML Mapping File, the Parser generates a set of instructions to the Generator The Generator instantiates in-memory the resulting RDF graph Persist the generated RDF graph into An RDF file in the Hard Disk, or In Jena’s relational database, or In Jena’s TDB (Tuple Data Base, a custom implementation Of B+ Trees) Log the results Parser Generator Mapping file Source database RDF graph Hard Disk Target database TDB R2RML Parser 104th International Conference on Web Intelligence, Mining and Semantics

Incremental RDF Triple Generation Basic challenge Discover, since the last time the incremental RDF generation took place Which database tuples were modified Which Triples Maps were modified Then, perform the mapping only for this altered subset Ideally, we should detect the exact changed database cells and modify only the respectively generated elements in the RDF graph However, using R2RML, the atom of the mapping definition becomes the Triples Map abcde spO Triples Map 4th International Conference on Web Intelligence, Mining and Semantics Result set Triples 11

Keeping Track Reification Allows assertions about RDF statements “Reified” model A model that contains only reified statements Stores the Triples Map URI that produced each triple Logging Store MD5 hashes of Triples Maps, SELECT queries, Result sets A change in any of the hashes triggers execution of the Triples Map subject property object subject property object blank node rdf:subject rdf:type rdf:predicate rdf:object rdf:Statement dc:source Triple Map URI foaf:name "John Smith". [] a rdf:Statement ; rdf:subject ; rdf:predicate foaf:name ; rdf:object "John Smith" ; dc:source map:persons.

Proposed Approach For each Triples Map in the Mapping Document Decide whether we have to produce the resulting triples, based on the logged MD5 hashes Dumping to the Hard Disk Initially, generate a number of RDF triples RDF triples are logged as reified statements, followed by a provenance note Incremental generation In subsequent executions, modify the existing reified model, by reflecting only the changes in the source database Dumping to the Database or TDB No log is needed, storage is incremental by default 134th International Conference on Web Intelligence, Mining and Semantics

Measurements Setup An Ubuntu server, 2GHz dual-core, 4GB RAM Oracle Java 1.7, Postgresql 9.1, Mysql DSpace (dspace.org) repositories 1k, 5k, 10k, 50k, 100k, 500k, 1m items, respectively A set of complicated, a set of simplified, and a set of simple queries In order to deal with database caching effects, the queries were run several times, prior to performing the measurements 144th International Conference on Web Intelligence, Mining and Semantics

Query Sets Complicated 3 expensive JOIN conditions among 4 tables 4 WHERE clauses Simplified 2 JOIN conditions among 3 tables 2 WHERE clauses Simple No JOIN or WHERE conditions SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadataschemaregistry AS msr, metadatafieldregistry AS mfr WHERE msr.metadata_schema_id=mfr.metadata_schema_id AND mfr.metadata_field_id=mv.metadata_field_id AND mv.text_value is not null AND i.item_id=mv.item_id AND msr.namespace=' terms/' AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadatafieldregistry AS mfr WHERE mfr.metadata_field_id=mv.metadata_field_id AND i.item_id=mv.item_id AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT "language", "netid", "phone", "sub_frequency","last_active", "self_registered", "require_certificate", "can_log_in", "lastname", "firstname", "digest_algorithm", "salt", "password", " ", "eperson_id" FROM "eperson" ORDER BY "language"

Measurements (1/3) Export to the Hard Disk Simple and complicated queries, initial export Initial incremental dumps take more time than non-incremental, as the reified model also has to be created 164th International Conference on Web Intelligence, Mining and Semantics

Measurements (2/3) 12 Triples Maps a.non-incremental mapping transformation b.incremental, for the initial time c.0/12 d.1/12 e.3/12 f.6/12 g.9/12 h.12/12 174th International Conference on Web Intelligence, Mining and Semantics Data change

Measurements (3/3) Similar overall behavior in cases of up to 3 million triples Poor relational database performance Jena TDB is the optimal approach regarding scalability 184th International Conference on Web Intelligence, Mining and Semantics ~180k triples ~1.8m triples

Outline Introduction Proposed Approach and Measurements Discussion and Conclusions 194th International Conference on Web Intelligence, Mining and Semantics

Discussion (1/2) The approach is efficient when data freshness is not crucial and/or selection queries over the contents are more frequent than the updates The task of exposing database contents as RDF could be considered similar to the task of maintaining search indexes next to text content Third party software systems can operate completely based on the exported graph E.g. using Fuseki, Sesame, Virtuoso Updates can be pushed or pulled from the database TDB is the optimal solution regarding scaling Caution is still needed in producing de-referenceable URIs 204th International Conference on Web Intelligence, Mining and Semantics

Discussion (2/2) On the efficiency of the approach for storing on the Hard Disk Good results for mappings (or queries) that include (or lead to) expensive SQL queries E.g. with numerous JOIN statements For changes that can affect as much as ¾ of the source data Limitations By physical memory Scales up to several millions of triples, does not qualify as “Big Data” Formatting of the logged reified model did affect performance RDF/XML and TTL try to pretty-print the result, consuming extra resources, N-TRIPLES is the optimal 214th International Conference on Web Intelligence, Mining and Semantics

Room for Improvement Hashing Result sets is expensive Requires re-run of the query, adds an “expensive” ORDER BY clause Further study the impact of SQL complexity on the performance Reification is currently being reconsidered in RDF 1.1 semantics Named graphs being the successor Investigation of two-way updates Send changes from the triplestore back to the database 224th International Conference on Web Intelligence, Mining and Semantics

Thank you for your attention! Questions? 234th International Conference on Web Intelligence, Mining and Semantics