Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Registries Work Package 2 Requirements, Science Cases, Use Cases, Test Cases Charter: Focus on science case scenarios, and use cases related specifically.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Terrier Workshop: 24 th October 2007 Alasdair J G Gray.
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
A Registry for controlled vocabularies at the Library of Congress
Overview of Search Engines
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
1 Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies.
Accessing Existing Distributed Science Archives as RDF Models Alasdair J G Gray 1 Norman Gray 2 Iadh Ounis 1 1 Computing Science, University of Glasgow.
AstroTag MUG meeting, STScI December Data Tagging Storing associations between data sets and tags (words/phrases) – IPPPSSOOT {w_1, w_2, …, w_n}
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Some facets of knowledge management in mathematics Wolfram Sperber (Zentralblatt Math) Patrick Ion (Math Reviews) Facets of Knowledge Organization A tribute.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Publishing data on the Web (with.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
S. Derriere et al., ESSW03 Budapest, 2003 May 20 UCDs - metadata for astronomy Sébastien Derriere François Ochsenbein Thomas Boch CDS, Observatoire astronomique.
Information Extraction with Linked Life Data 19/04/2011.
BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Metadata Agents and Semantic Mediation Mikhaila Burgess Cardiff University.
Terminology services and the DDC: the High-Level Thesaurus and beyond Presented to the symposium Dewey goes Europe: on the use and development of the Dewey.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Vocabularies in the VO Alasdair J G Gray Norman Gray Iadh Ounis.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
NEON Obs School 11-Aug-2005 Archival Data and Virtual Observatories 1 Virtual Observatories...or how to do your research from a beach in the Bahamas rather.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Incorporating ARGOVOC in DSpace-based Agricultural Repositories Dr. Devika P. Madalli & Nabonita Guha Documentation Research & Training Centre Indian Statistical.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
The Explicator Project: Integrating Astronomy Data with Semantic Web Tools Alasdair J G Gray Information Management Group Seminar University of Manchester.
Exploitation of Dynamic Information Relations in the Service-Oriented AFRL Information Management Systems Andrzej Uszok, Larry Bunch, Jeffrey M. Bradshaw.
Semantic Access to Existing Archives Using RDF and SPARQL Alasdair J G Gray.
Coastal Atlas Interoperability - Ontologies (Advanced topics that we did not get to in detail) Luis Bermudez Stephanie Watson Marine Metadata Interoperability.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
It’s all semantics! The premises and promises of the semantic web. Tony Ross Centre for Digital Library Research, University of Strathclyde
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
The International Virtual Observatory Alliance (IVOA) interoperability in action.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Raluca Paiu1 Semantic Web Search By Raluca PAIU
Introduction to the VO ESAVO ESA/ESAC – Madrid, Spain.
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Linked Open Data for European Earth Observation Products Carlo Matteo Scalzo CTO, Epistematica epistematica.
The AstroGrid-D Information Service Stellaris A central grid component to store, manage and transform metadata - and connect to the VO!
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Information Retrieval in Practice
The Semantic Web By: Maulik Parikh.
Middleware independent Information Service
NKOS workshop Alicante, 2006
PREMIS Tools and Services
LOD reference architecture
The state of VOEvent semantics THE US NATIONAL VIRTUAL OBSERVATORY
Google Sky.
Presentation transcript:

Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

Outline Motivation: Astronomy & The Virtual Observatory Data Integration Challenges 1.Locating the relevant data 2.Requesting the required data 3.Understanding the returned data 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar1 Using Semantic Web tools and technology to overcome these challenges

Context: Astronomy 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar2 Data collected across electromagnetic spectrum Analysed within one wavelength Data collection is – expensive – time consuming Existing data – large quantities – freely available Image: Wikipedia

Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar3

Searching for Brown Dwarfs Data sets: – Near Infrared, 2MASS/UK Infrared Deep Sky Survey – Optical, APMCAT/Sloan Digital Sky Survey Complex colour/motion selection criteria Similar problems – Halo White Dwarfs 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar4 Image: AstroGrid

Deep Field Surveys Observations in multiple wavelengths – Radio to X-Ray Searching for new objects – Galaxies, stars, etc Requires correlations across many catalogues – ISO – Hubble – SCUBA – etc 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar5

Locate, retrieve, and interpret relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Image Files 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar6 Virtual Observatory Virtual Observatory: The Problems

Locate, retrieve, and interpret relevant data 1.Which data sources contain relevant data? 2.How do I query the relevant data sources? 3.How can I interpret/combine/analyse the data? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar7 Virtual Observatory

Finding relevant data sources 1.Which data sources contain relevant data? 23 June 20098A.J.G. Gray - Bolzano-Bozen Seminar

Which data sources do I use? VO registry – 65,000+ entries – Many mirrored services VOExplorer – Registry search tool Resources tagged with keywords 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar9 - 6df - survey - galaxy - galaxies - redshift - redshifts - 2mass - 6df - survey - galaxy - galaxies - redshift - redshifts - 2mass

Analysis of Registry Keywords Problems: – Plural/singular – Case – Abbreviations – Different tags – Specificity of tags Thanks to Sébastien Derriere for this data. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar10 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

Analysis of Registry Keywords Problems: – Plural/singular – Case Solution: (standard IR techniques) – Stemming Star & Stars become Star – Case normalisation lowercase 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar11 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

Analysis of Registry Keywords Problems: – Abbreviations – Different tags – Specificity of tags Solution: Need to understand semantics! 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar12 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

Semantic Options Folksonomies – Keyword tags, freely chosen Vocabulary – Controlled list of words with definitions Taxonomy – Relationships: Broader/Narrower/Related Thesaurus – Synonyms, antonyms, see also Ontology – Formal specification of a shared conceptualisation – OWL “Vocabulary” used to cover vocabularies, taxonomies, and thesauri. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar13 Image: Leonard Cohen Search Leonard Cohen Search

What is a Vocabulary? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar14

What is a Controlled Vocabulary? A set of terms with: – Label – Synonyms – Definition – Relationships to other terms: Broader term Narrower term Related term Example: – “Spiral galaxy” – “Spiral nebula” – “A galaxy having a spiral structure” – Relationships carrying semantic information: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar15

Existing Vocabularies in Astronomy Journal Keywords – Developed for tagging papers – 311 terms – Actively used Astronomy Visualization Metadata (AVM) – Tagging images – 217 terms – Actively used IAU Thesaurus – Developed for libraries in 1993 – 2,551 terms – Never really used Unified Content Descriptor (UCD) – Tagging resource data – 473 terms – Actively used 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar16

Common Vocabulary Format Requirements: – Provide term identifiers Unambiguous tagging – Capture semantic relationships Poly-hierarchy structure – Machine processable Allows inter-operability “Machine intelligence” – Avoids problems of: Spelling Case Plurality problems Tags – Automated reasoning: Interested in all “Supernova” Items tagged as “1a Supernova” also returned 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar17

SKOS – W3C standard for sharing vocabularies – Based on RDF Semantic model for describing resources – Provides URI for each term – Captures properties of terms – Encodes relationships between terms Enables automated reasoning Standard serialisations “Looser” semantics than OWL – Adopted by IVOA as a standard for vocabularies 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar18

Example SKOS Vocabulary Term Example “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” In turtle notation #spiralGalaxy a concept; prefLabel “Spiral altLabel “Spiral definition “A galaxy having a spiral broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar19

Inter-operable Vocabularies Which vocabulary should I use? One that you know! Closest match to your needs Vocabulary terms related using mappings – Part of the SKOS standard – One mapping file per pair of vocabularies Inter-vocabulary mappings Broad match: – more general term Narrow match: – more specific term Related match: – associated term Exact match: – equivalent term Close match: – similar but not equivalent term 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar20

Mapping Editor 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar21

Putting it all together Use vocabulary concepts for – Tagging (using URI) Resources in the registry VOEvent packets – Searching by vocabulary concept User keyword search converted to vocabulary URI Provides semantic advantages – Reasoning about terms Relationships (Intra-vocabulary) Mappings (Inter-vocabulary) Requires a mechanism to convert a string to a concept 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar22

Vocabulary Explorer Search and browse vocabularies – Configure Vocabularies Mappings Based on Information Retrieval techniques Matching mechanisms Ranking results 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar23

Search Results RunBB2BM25DFR- BM25 IFB2In- expB2 In- expC2 InL2PL2TF-IDF Initial Query Expansion Term weighting Term weighting Combined June 2009A.J.G. Gray - Bolzano-Bozen Seminar24 Evaluation over 59 queries nDCG evaluation model (distinguishes highly relevant/relevant/not relevant)

Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar25

Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar26

Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar27

Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar28

Finding the Right Term: Conclusions Vocabularies improve search – Remove ambiguity – Increase precision and recall – Enable Reasoning about relevance Faceted browsing Provided tools for working with vocabularies – Reliable search from keyword string to vocabulary term – Exploration of vocabularies – Mapping terms across vocabularies 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar29

Extracting relevant data 2.How do I query the relevant data sources? 23 June A.J.G. Gray - Bolzano-Bozen Seminar

Locate, retrieve, and interpret relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Image Files 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar31 Virtual Observatory Virtual Observatory: The Problems

A Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Homogeneous view – Mediated global schema Mapping – LAV: local-as-view – GAV: global-as-view 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar32 Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

P2P Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Heterogeneous views – Multiple schemas Mappings – From sources to common schema – Between pairs of schema Require common integration data model Can RDF do this? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar33 Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schema j Query 1 Query n Mappings

Resource Description Framework W3C standard Designed as a metadata data model Contains semantic details Ideal for linking distributed data Queried through SPARQL 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar34 #foundIn #Sun The Sun #name #MilkyWay Milky Way #name The Galaxy #name IAU:Star rdf:type IAU:BarredSpiral rdf:type

SPARQL Declarative query language – Select returned data (projection) Graph or tuples Attributes to return – Describe structure of desired results – Filter data (selection) W3C standard Syntactically similar to SQL – Should be easy for astronomers to learn! 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar35

Integrating Using RDF Data resources – Expose schema and data as RDF – Need a SPARQL endpoint Allows multiple – Access models – Storage models Easy to relate data from multiple sources 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar36 Relational DB RDF / Relational Conversion XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query We will focus on exposing relational data sources

RDB2RDF: Two Approaches Extract-Transform-Load Data replicated as RDF – Data can become stale Native SPARQL query support – Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support – Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar37

System Test Hypothesis Is it viable to perform query- driven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar38 Relational DB RDB2RDF XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query SPARQL query

Astronomical Test Data Set SuperCOSMOS Science Archive (SSA) – Data extracted from scans of Schmidt plates – Stored in a relational database – About 4TB of data, detailing 6.4 billion objects – Fairly typical of astronomical data archives Schema designed using 20 real queries Personal version contains – Data for a specific region of the sky – About 0.1% of the data – About 500MB 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar39 Image: SuperCOSMOS Science Archive

Analysis of Test Data Using personal version – About 500MB in size (similar size to related work) Organised in 14 Relations – Number of attributes: 2 – relations with more than 20 attributes – Number of rows: 3 – 585,560 – Two views Complex selection criteria in views 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar40 Makes this different from business cases and previous work!

Is SPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar41

Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN AND 0.64) SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE { … FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= && ?sCorMagR2 - ?sCorMagI <= 0.64)} 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar42

Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. Radians to degrees)7, 8, 12 Server defined functions10, June 2009A.J.G. Gray - Bolzano-Bozen Seminar43

Expressivity of SPARQL Features Select-project-join Arithmetic in body Conjunction and disjunction Ordering String matching External function calls (extension mechanism) Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar44

Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. radians to degrees)7, 8, 12 Server defined functions10, June 2009A.J.G. Gray - Bolzano-Bozen Seminar45 Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19

Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar46

Experiment Time query evaluation – 5 out of 20 queries used – No joins Systems compared: – Relational DB (Base line) MySQL v – RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 – RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar47 Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

Experimental Configuration 8 identical machines – 64 bit Intel Quad Core Xeon 2.4GHz – 4GB RAM – 100 GB Hard drive – Java 1.6 – Linux 10 repetitions 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar48

Performance Results 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar49 ms 3,450 5,339 21, ,932 2,7337,2294,0901,307 17,793 7,468 19, ,561 1

The Show Stopper: Query Translation Each bound variable resulted in a self-join – RDBMS cannot optimize for this – RDBMS perform badly with self-joins Each row retrieved with a separate query – 1 query becomes n queries, where n is cardinality of relation Predicate selection in RDB2RDF tool – No optimization possible 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar50

Extracting Relevant Data: Conclusions SPARQL not expressive enough for real (astronomy) queries RDBMS benefits from 30+ years research – Query optimisation – Indexes RDF stores are improving – Require existing data to be replicated RDB2RDF tools show promise – Need to exploit relational database 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar51

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! More work needed on query translation… 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar52

Conclusions & Future Work Traditional Integration Challenges 1.Locating data 2.Extracting relevant data 3.Understanding data Semantic Web Solution SKOS Vocabularies – Removes ambiguity – Enables limited machine understanding RDB2RDF Tools – Requires improved query translation Semantic model mappings – Follow “chains” of mappings – Relies on RDB2RDF work 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar53