Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray.

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

The Australian Virtual Observatory e-Science Meeting School of Physics, March 2003 David Barnes.
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Accessing Existing Distributed Science Archives as RDF Models Alasdair J G Gray 1 Norman Gray 2 Iadh Ounis 1 1 Computing Science, University of Glasgow.
Quete: Ontology-Based Query System for Distributed Sources Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis Kondylak, analyti,
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Publishing data on the Web (with.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Cool white dwarfs in the Sloan & SuperCOSMOS Sky Surveys Nigel Hambly, Wide Field Astronomy Unit, IfA, University of Edinburgh.
Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
Towards linked sensor data Analysis of project task, tools and Hackystat architecture Author: Myriam Leggieri GSoC 2009 project for Hackystat.
Peer-to-Peer Data Integration Using Distributed Bridges Neal Arthorne B. Eng. Computer Systems (2002) Supervisor: Babak Esfandiari April 12, 2005 Candidate.
EdSkyQuery-G Overview Brian Hills, December
Goodbye rows and tables, hello documents and collections.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Ultrawrap: SPARQL Execution on Relational Data Juan F. Sequeda, Daniel P. Miranker University of Texas - Austin ISWC 2009 Seoul National University Internet.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
NEON Obs School 11-Aug-2005 Archival Data and Virtual Observatories 1 Virtual Observatories...or how to do your research from a beach in the Bahamas rather.
Markus Dolensky, ESO Technical Lead The AVO Project Overview & Context ASTRO-WISE ((G)A)VO Meeting, Groningen, 06-May-2004 A number of slides are based.
A Survey on Programming Model Context Toolkit Gaia ETC (of Equator Project) Tentaculus.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
The Explicator Project: Integrating Astronomy Data with Semantic Web Tools Alasdair J G Gray Information Management Group Seminar University of Manchester.
Integration of Spatial Information Sources Based on Source Description Framework Yoshiharu Ishikawa, Gihyong Ryu, and Hiroyuki Kitagawa University of Tsukuba.
Exploitation of Dynamic Information Relations in the Service-Oriented AFRL Information Management Systems Andrzej Uszok, Larry Bunch, Jeffrey M. Bradshaw.
Semantic Access to Existing Archives Using RDF and SPARQL Alasdair J G Gray.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
Web: Minimal Metadata for Data Services Through DIALOGUE Neil Chue Hong AHM2007.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
The International Virtual Observatory Alliance (IVOA) interoperability in action.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Intro to GIS | Summer 2012 Attribute Tables – Part 1.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
RDF and Relational Databases
Triple Storage. Copyright  2006 by CEBT Triple(RDF) Storages  A triple store is designed to store and retrieve identities that are constructed from.
Introduction to the VO ESAVO ESA/ESAC – Madrid, Spain.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Linked Open Data for European Earth Observation Products Carlo Matteo Scalzo CTO, Epistematica epistematica.
The AstroGrid-D Information Service Stellaris A central grid component to store, manage and transform metadata - and connect to the VO!
Introduction: AstroGrid increases scientific research possibilities by enabling access to distributed astronomical data and information resources. AstroGrid.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Flanders Marine Institute (VLIZ)
Middleware independent Information Service
Planning Observations
LOD reference architecture
Google Sky.
CSE591: Data Mining by H. Liu
HP Labs and the semantic web
Presentation transcript:

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray (Universities of Leicester and Glasgow)  Iadh Ounis (University of Glasgow) ESWC 2009 – Crete 3 June 2009

Outline Motivation: The Virtual Observatory Can SPARQL be used to express scientific queries? Can existing archives be exposed with semantic tools? – Can RDB2RDF tools extract large volumes of data? 3 June 2009A.J.G. Gray - ESWC 20091

International Virtual Observatory Alliance “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” 3 June 2009A.J.G. Gray - ESWC 20092

Searching for Brown Dwarfs Data sets: – Near Infrared, 2MASS/UK Infrared Deep Sky Survey – Optical, APMCAT/Sloan Digital Sky Survey Complex colour/motion selection criteria Similar problems – Halo White Dwarfs 3 June 20093A.J.G. Gray - ESWC 2009

Deep Field Surveys Observations in multiple wavelengths – Radio to X-Ray Searching for new objects – Galaxies, stars, etc Requires correlations across many catalogues – ISO – Hubble – SCUBA – etc 3 June 20094A.J.G. Gray - ESWC 2009

The Problem Locate and combine relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Files Virtual Observatory 3 June 20095A.J.G. Gray - ESWC 2009

A Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Homogeneous view – Mediated global schema Mapping – LAV: local-as-view – GAV: global-as-view 3 June 2009A.J.G. Gray - ESWC Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

P2P Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Heterogeneous views – Multiple schemas Mappings – From sources to common schema – Between pairs of schema Require common integration data model Can RDF do this? 3 June 2009A.J.G. Gray - ESWC Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schema j Query 1 Query n Mappings

Resource Description Framework W3C standard Designed as a metadata data model Contains semantic details Ideal for linking distributed data Queried through SPARQL 3 June 2009A.J.G. Gray - ESWC #foundIn #Sun The Sun #name #MilkyWay Milky Way #name The Galaxy #name IAU:Star rdf:type IAU:BarredSpiral rdf:type

SPARQL Declarative query language – Select returned data Graph or tuples Attributes to return – Describe structure of desired results – Filter data W3C standard Syntactically similar to SQL – Should be easy for scientists to learn! 3 June 20099A.J.G. Gray - ESWC 2009

Integrating Using RDF Data resources – Expose schema and data as RDF – Need a SPARQL endpoint Allows multiple – Access models – Storage models Easy to relate data from multiple sources Relational DB RDF / Relational Conversion XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query 3 June A.J.G. Gray - ESWC 2009 We will focus on exposing relational data sources

RDB2RDF Extract-Transform-Load Data replicated as RDF – Data can become stale Native SPARQL query support – Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support – Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 3 June A.J.G. Gray - ESWC 2009

System Hypothesis Is it viable to perform query-driven conversions to facilitate data access from a data model that a scientist is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? Relational DB RDB2RDF XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query 3 June A.J.G. Gray - ESWC 2009 SPARQL query

Astronomical Test Data Set SuperCOSMOS Science Archive (SSA) – Data extracted from scans of Schmidt plates – Stored in a relational database – About 4TB of data, detailing 6.4 billion objects – Fairly typical of astronomical data archives Schema designed using 20 real queries Personal version contains – Data for a specific region of the sky – About 0.1% of the data – About 500MB 3 June A.J.G. Gray - ESWC 2009

Analysis of Test Data Using personal version – About 500MB in size (similar size to related work) Organised in 14 Relations – Number of attributes: 2 – relations with more than 20 attributes – Number of rows: 3 – 585,560 – Two views Complex selection criteria in views 3 June A.J.G. Gray - ESWC 2009 Makes this different from business cases and previous work!

Is SPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? 3 June 2009A.J.G. Gray - ESWC

Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN AND 0.64) SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE { FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= && ?sCorMagR2 - ?sCorMagI <= 0.64)} 3 June 2009A.J.G. Gray - ESWC

Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. Radians to degrees)7, 8, 12 Server defined functions10, 11 3 June A.J.G. Gray - ESWC 2009

Expressivity of SPARQL Features Select-project-join Arithmetic in body Conjunction and disjunction Ordering String matching External function calls (extension mechanism) Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting 3 June A.J.G. Gray - ESWC 2009

Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. radians to degrees)7, 8, 12 Server defined functions10, 11 Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 3 June A.J.G. Gray - ESWC 2009

Can RDB2RDF tools feasibly expose large science archives for data integration? 3 June 2009A.J.G. Gray - ESWC

Experiment Time query evaluation – 5 out of 20 queries used – No joins Systems compared: – Relational DB (Base line) MySQL v – RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 – RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 3 June A.J.G. Gray - ESWC 2009 Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

Experimental Configuration 8 identical machines – 64 bit Intel Quad Core Xeon 2.4GHz – 4GB RAM – 100 GB Hard drive – Java 1.6 – Linux 10 repetitions 3 June 2009A.J.G. Gray - ESWC

Performance Results 3 June A.J.G. Gray - ESWC 2009 ms 3,4505,339 21, ,932 2,7337,2294,0901,307 17,793 7,468 19, ,561 1

Conclusions SPARQL not expressive enough for real astronomical queries RDBMS benefits from 30+ years research – Query optimisation – Indexes RDF stores are improving – Require existing data to be replicated RDB2RDF tools show promise – Need to exploit relational database 3 June 2009A.J.G. Gray - ESWC

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! We need more work on query translation… 3 June 2009A.J.G. Gray - ESWC