An Approach for the Incremental Export of Relational Databases into RDF Graphs N. Konstantinou, D.-E. Spanos, D. Kouis, N. Mitrou National Technical University.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Semantic Web Introduction
RDF Databases By: Chris Halaschek. Outline Motivation / Requirements Storage Issues Sesame General Introduction Architecture Scalability RQL Introduction.
Triple Stores
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Chapter 14 The Second Component: The Database.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
ONTOLOGY ENGINEERING Lab #9 - November 3, Linking Relational Databases to Ontologies 2  Relational databases are still a common means of storing.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Conceptual Architecture of PostgreSQL PopSQL Andrew Heard, Daniel Basilio, Eril Berkok, Julia Canella, Mark Fischer, Misiu Godfrey.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Triple Stores.
Objectives of the Lecture :
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Incremental Export of Relational Database Contents into RDF Graphs Nikolaos Konstantinou, Dimitris Kouis, Nikolas Mitrou By Dr. Nikolaos Konstantinou National.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Chapter 3 Querying RDF stores with SPARQL. Why an RDF Query Language? Why not use an XML query language? XML at a lower level of abstraction than RDF.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Database Management. ICT5 Database Administration (DBA) The DBA’s tasks will include the following: 1. The design of the database. After the initial design,
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Relational Databases to RDF (a.k.a RDB2RDF) Juan F. Sequeda Dept of Computer Science University of Texas at Austin.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
Relational Database CISC/QCSE 810 some materials from Software Carpentry.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
On the Semantics of R2RML and its Relationship with the Direct Mapping Juan F. Sequeda Research in Bioinformatics and Semantic Web (RiBS) Lab Department.
DEPICT: DiscovEring Patterns and InteraCTions in databases A tool for testing data-intensive systems.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Views In some cases, it is not desirable for all users to see the entire logical model (that is, all the actual relations stored in the database.) In some.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Methodology – Physical Database Design for Relational Databases.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Practical RDF Chapter 10. Querying RDF: RDF as Data Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Lim.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Session 1 Module 1: Introduction to Data Integrity
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
SQL Basics Review Reviewing what we’ve learned so far…….
Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.
Introduction to Database Programming with Python Gary Stewart
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
Triple Stores.
RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.
Nikolaos Konstantinou Nikos Houssos Anastasia Manta
Methodology – Physical Database Design for Relational Databases
Triple Stores.
Practical Database Design and Tuning
Triple Stores.
Triple Stores.
Presentation transcript:

An Approach for the Incremental Export of Relational Databases into RDF Graphs N. Konstantinou, D.-E. Spanos, D. Kouis, N. Mitrou National Technical University of Athens 13th Hellenic Data Management Symposium (HDMS'15)

About this work  Started as a project in 2012 in the National Documentation Centre, to offer Linked Data views over its contents  Evolved as a standards-compliant open-source tool  First results presented in IC-ININFO’13 and MTSR’13  Journal paper of the presentation of the software used was awarded an Outstanding Paper Award  Latest results presented in WIMS’14  Revised and extended version in a special issue in IJAIT (2015) 13th Hellenic Data Management Symposium (HDMS'15)2

Outline  Introduction  Background  Proposed Approach  Measurements  Conclusions 13th Hellenic Data Management Symposium (HDMS'15)3

Introduction  Information collection, maintenance and update is not always taking place directly at a triplestore, but at a RDBMS  It can be difficult to change established methodologies and systems  Especially in less frequently changing environments, e.g. libraries  Triplestores are often kept as an alternative content delivery channel  Newer technologies need to operate side-by-side to existing ones before migration 13th Hellenic Data Management Symposium (HDMS'15)4

Mapping Relational Data to RDF  Synchronous or Asynchronous RDF Views  Real-time SPARQL-to-SQL or Querying the RDF dump using SPARQL  Queries on the RDF dump are faster in certain conditions, compared to round-trips to the database  Difference in the performance more visible when SPARQL queries involve numerous triple patterns (which translate to expensive JOIN statements)  In this paper, we focus on the asynchronous approach  Exporting (dumping) relational database contents into an RDF graph 13th Hellenic Data Management Symposium (HDMS'15)5

Incremental Export into RDF (1/2)  Problem  Avoid dumping the whole database contents every time  In cases when few data change in the source database, it is not necessary to dump the entire database  Approach  Every time the RDF export is materialized  Detect the changes in the source database or the mapping definition  Insert/delete/update only the necessary triples, in order to reflect these changes in the resulting RDF graph 613th Hellenic Data Management Symposium (HDMS'15)

Incremental Export into RDF (2/2)  Incremental transformation  Each time the transformation is executed, only the part in the database that changed should be transformed into RDF  Incremental storage  Storing (persisting) to the destination RDF graph only the triples that were modified and not the whole graph  Possible only when the resulting RDF graph is stored in a relational database or using Jena TDB 713th Hellenic Data Management Symposium (HDMS'15)

Outline  Introduction  Background  Proposed Approach  Measurements  Conclusions 13th Hellenic Data Management Symposium (HDMS'15)8

Reification in RDF 13th Hellenic Data Management Symposium (HDMS'15)9 "John Smith" foaf:name "John Smith" blank node rdf:Statement foaf:name rdf:type rdf:subject rdf:predicate rdf:object foaf:name "John Smith". becomes [] a rdf:Statement ; rdf:subject ; rdf:predicate foaf:name ; rdf:object "John Smith" ; dc:source map:persons.

Reification in RDF 13th Hellenic Data Management Symposium (HDMS'15)10 "John Smith" foaf:name "John Smith" blank node rdf:Statement map:persons foaf:name rdf:type rdf:subject rdf:predicate rdf:object dc:source foaf:name "John Smith". becomes [] a rdf:Statement ; rdf:subject ; rdf:predicate foaf:name ; rdf:object "John Smith" ; dc:source map:persons.  Ability to annotate every triple  E.g. the mapping definition that produced it

R2RML  RDB to RDF Mapping Language  A W3C Recommendation, as of 2012  Mapping documents contain sets of Triples Maps 13th Hellenic Data Management Symposium (HDMS'15)11 RefObjectMap TriplesMap PredicateObjectMap SubjectMap Generated Triples Generated Output Dataset GraphMap Join ObjectMap PredicateMap LogicalTable

Triples Maps in R2RML (1)  Reusable mapping definitions  Specify a rule for translating each row of a logical table to zero or more RDF triples  A logical table is a tabular SQL query result set that is to be mapped to RDF triples  Execution of a triples map generates the triples that originate from the specific result set 1213th Hellenic Data Management Symposium (HDMS'15)

Triples Maps in R2RML (2)  An example 13th Hellenic Data Management Symposium (HDMS'15) map:persons rr:logicalTable [ rr:tableName '"eperson"'; ]; rr:subjectMap [ rr:template ' rr:class foaf:Person; ]; rr:predicateObjectMap [ rr:predicate foaf:name; rr:objectMap [ rr:template '{"firstname"} {"lastname"}' ; rr:termType rr:Literal; ] ]. 13

An R2RML Mapping dcterms:. map:persons-groups rr:logicalTable [ rr:tableName '"epersongroup2eperson"'; ]; rr:subjectMap [ rr:template ' ]; rr:predicateObjectMap [ rr:predicate foaf:member; rr:objectMap [ rr:template ' rr:termType rr:IRI; ] ]. 14 foaf:member,. Table epersongroup2eperson

Outline  Introduction  Background  Proposed Approach  Measurements  Conclusions 13th Hellenic Data Management Symposium (HDMS'15)15

The R2RML Parser tool  An R2RML implementation  Command-line tool that can export relational database contents as RDF graphs, based on an R2RML mapping document  Open-source (CC BY-NC), written in Java  Publicly available at  Worldwide interest (Ontotext, Abbvie, Financial Times)  Tested against MySQL, PostgreSQL, and Oracle  Output can be written in RDF/OWL  N3, Turtle, N-Triple, TTL, RDF/XML(-ABBREV) notation, or Jena TDB backend  Covers most (not all) of the R2RML constructs (see the wiki)wiki  Does not offer SPARQL-to-SQL translations 16

Information Flow (1)  Parse the source database contents into result sets  According to the R2RML Mapping File, the Parser generates a set of instructions to the Generator  The Generator instantiates in-memory the resulting RDF graph  Persist the generated RDF graph into  An RDF file in the Hard Disk, or  In Jena’s relational database (eventually rendered obsolete), or  In Jena’s TDB (Tuple Data Base, a custom implementation of B+ trees)  Log the results Parser Generator Mapping file Source database RDF graph TDB R2RML Parser 17 Target database Hard Disk

Information Flow (2)  Overall generation time is the sum of the following:  t1: Parse mapping document  t2: Generate Jena model in memory  t3: Dump model to the destination medium  t4: Log the results  In incremental transformation, the log file contains the reified model  A model that contains only reified statements  Statements are annotated with the Triples Map URI that produced them 1813th Hellenic Data Management Symposium (HDMS'15)

Incremental RDF Triple Transformation  Basic challenge  Discover, since the last time the incremental RDF generation took place  Which database tuples were modified  Which Triples Maps were modified  Then, perform the mapping only for this altered subset  Ideally, we should detect the exact changed database cells and modify only the respectively generated elements in the RDF graph  However, using R2RML, the atom of the mapping definition becomes the Triples Map abcde spO Triples Map Result set Triples 11 13th Hellenic Data Management Symposium (HDMS'15)

Incremental transformation  Possible when the resulting RDF graph is persisted on the hard disk  The algorithm does not run the entire set of triples maps  Consult the log file with the output of the last run of the algorithm  MD5 hashes of triples maps definitions, SELECT queries, and respective query resultsets  Perform transformations only on the changed data subset  I.e. triples maps for which a change was detected  The resulting RDF graph file is erased and rewritten on the hard disk  Retrieve unchanged triples from the log file  Log file contains a set of reified statements, annotated as per source Triples Maps definition 13th Hellenic Data Management Symposium (HDMS'15)20

Incremental storage  Store changes without rewriting the whole graph  Possible when the resulting graph is persisted in an RDF store  Jena’s TDB in our case  The output medium must allow additions/deletions/modifications at the triples level 13th Hellenic Data Management Symposium (HDMS'15)21

Proposed Approach  For each Triples Map in the Mapping Document  Decide whether we have to produce the resulting triples, based on the logged MD5 hashes  Dumping to the Hard Disk  Initially, generate the number of RDF triples that correspond to the source database  RDF triples are logged and annotated as reified statements  Incremental generation  In subsequent executions, modify the existing reified model, by reflecting only the changes in the source database  Dumping to a database or to TDB  No log is needed, storage is incremental by default 22

Outline  Introduction  Background  Proposed Approach  Measurements  Conclusions 13th Hellenic Data Management Symposium (HDMS'15)23

Measurements Setup  An Ubuntu server, 2GHz dual-core, 4GB RAM  Oracle Java 1.7, Postgresql 9.1, Mysql  7 DSpace (dspace.org) repositories  1k, 5k, 10k, 50k, 100k, 500k, 1m items, respectively  Random data text values (2-50 chars) populating a random number (5-30) of Dublin Core metadata fields  A set of SQL queries: complicated, simplified, and simple  In order to deal with database caching effects, the queries were run several times, prior to performing the measurements 2413th Hellenic Data Management Symposium (HDMS'15)

Query Sets  Complicated  3 expensive JOIN conditions among 4 tables  4 WHERE clauses  Simplified  2 JOIN conditions among 3 tables  2 WHERE clauses  Simple  No JOIN or WHERE conditions SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadataschemaregistry AS msr, metadatafieldregistry AS mfr WHERE msr.metadata_schema_id=mfr.metadata_schema_id AND mfr.metadata_field_id=mv.metadata_field_id AND mv.text_value is not null AND i.item_id=mv.item_id AND msr.namespace=' terms/' AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT i.item_id AS item_id, mv.text_value AS text_value FROM item AS i, metadatavalue AS mv, metadatafieldregistry AS mfr WHERE mfr.metadata_field_id=mv.metadata_field_id AND i.item_id=mv.item_id AND mfr.element='coverage' AND mfr.qualifier='spatial' SELECT "language", "netid", "phone", "sub_frequency","last_active", "self_registered", "require_certificate", "can_log_in", "lastname", "firstname", "digest_algorithm", "salt", "password", " ", "eperson_id" FROM "eperson" ORDER BY "language" Q1: * Q2: * Q3: * * Score obtained using PostgreSQL’s EXPLAIN

Measurements Results  Exporting to an RDF File  Exporting to a Relational Database  Exporting to Jena TDB 13th Hellenic Data Management Symposium (HDMS'15)26

Exporting to an RDF File (1)  Export to an RDF file  Simple and complicated queries, initial export  Initial incremental dumps take more time than non-incremental, as the reified model also has to be created 2713th Hellenic Data Management Symposium (HDMS'15)

nnon-incremental mapping transformation aincremental, for the initial time b0/12 (no changes) c1/12 d3/12 e6/12 f9/12 g12/12 Exporting to an RDF File (2)  12 Triples Maps 28 Data change

Exporting to a Database and to Jena TDB  Jena TDB is the optimal approach regarding scalability 29 ~180k triples ~1.8m triples 13th Hellenic Data Management Symposium (HDMS'15)

Outline  Introduction  Background  Proposed Approach  Measurements  Conclusions 13th Hellenic Data Management Symposium (HDMS'15)30

Conclusions (1)  The approach is efficient when data freshness is not crucial and/or selection queries over the contents are more frequent than the updates  The task of exposing database contents as RDF could be considered similar to the task of maintaining search indexes next to text content  Third party software systems can operate completely based on the exported graph  E.g. using Fuseki, Sesame, Virtuoso  TDB is the optimal solution regarding scalability  Caution is still needed in producing de-referenceable URIs 31

Conclusions (2)  On the efficiency of the approach for storing RDF on the Hard Disk  Good results for mappings (or queries) that include (or lead to) expensive SQL queries  E.g. with numerous JOIN statements  For changes that can affect as much as ¾ of the source data  Limitations  By physical memory  Scales up to several millions of triples, does not qualify as “Big Data”  Formatting of the logged model did affect performance  RDF/XML and TTL try to pretty-print the result, consuming extra resources  N-TRIPLES is optimal 3213th Hellenic Data Management Symposium (HDMS'15)

Future Work  Hashing Result sets is expensive  Requires re-run of the query, adds an “expensive” ORDER BY clause  Further study the impact of SQL complexity on the performance  Investigation of two-way updates  Send changes from the triplestore back to the database 3313th Hellenic Data Management Symposium (HDMS'15)

Questions? 3413th Hellenic Data Management Symposium (HDMS'15)