Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester.

Similar presentations


Presentation on theme: "Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester."— Presentation transcript:

1 Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

2 Outline Motivation: Astronomy & The Virtual Observatory Data Integration Challenges 1.Locating the relevant data 2.Requesting the required data 3.Understanding the returned data 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar1 Using Semantic Web tools and technology to overcome these challenges

3 Context: Astronomy 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar2 Data collected across electromagnetic spectrum Analysed within one wavelength Data collection is – expensive – time consuming Existing data – large quantities – freely available Image: Wikipedia

4 Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar3

5 Searching for Brown Dwarfs Data sets: – Near Infrared, 2MASS/UK Infrared Deep Sky Survey – Optical, APMCAT/Sloan Digital Sky Survey Complex colour/motion selection criteria Similar problems – Halo White Dwarfs 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar4 Image: AstroGrid

6 Deep Field Surveys Observations in multiple wavelengths – Radio to X-Ray Searching for new objects – Galaxies, stars, etc Requires correlations across many catalogues – ISO – Hubble – SCUBA – etc 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar5

7 Locate, retrieve, and interpret relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Image Files 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar6 Virtual Observatory Virtual Observatory: The Problems

8 Locate, retrieve, and interpret relevant data 1.Which data sources contain relevant data? 2.How do I query the relevant data sources? 3.How can I interpret/combine/analyse the data? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar7 Virtual Observatory

9 Finding relevant data sources 1.Which data sources contain relevant data? 23 June 20098A.J.G. Gray - Bolzano-Bozen Seminar

10 Which data sources do I use? VO registry – 65,000+ entries – Many mirrored services VOExplorer – Registry search tool Resources tagged with keywords 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar9 - 6df - survey - galaxy - galaxies - redshift - redshifts - 2mass - 6df - survey - galaxy - galaxies - redshift - redshifts - 2mass

11 Analysis of Registry Keywords Problems: – Plural/singular – Case – Abbreviations – Different tags – Specificity of tags Thanks to Sébastien Derriere for this data. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar10 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

12 Analysis of Registry Keywords Problems: – Plural/singular – Case Solution: (standard IR techniques) – Stemming Star & Stars become Star – Case normalisation lowercase 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar11 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

13 Analysis of Registry Keywords Problems: – Abbreviations – Different tags – Specificity of tags Solution: Need to understand semantics! 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar12 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars...

14 Semantic Options Folksonomies – Keyword tags, freely chosen Vocabulary – Controlled list of words with definitions Taxonomy – Relationships: Broader/Narrower/Related Thesaurus – Synonyms, antonyms, see also Ontology – Formal specification of a shared conceptualisation – OWL “Vocabulary” used to cover vocabularies, taxonomies, and thesauri. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar13 Image: Leonard Cohen Search Leonard Cohen Search

15 What is a Vocabulary? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar14

16 What is a Controlled Vocabulary? A set of terms with: – Label – Synonyms – Definition – Relationships to other terms: Broader term Narrower term Related term Example: – “Spiral galaxy” – “Spiral nebula” – “A galaxy having a spiral structure” – Relationships carrying semantic information: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar15

17 Existing Vocabularies in Astronomy Journal Keywords – Developed for tagging papers – 311 terms – Actively used Astronomy Visualization Metadata (AVM) – Tagging images – 217 terms – Actively used IAU Thesaurus – Developed for libraries in 1993 – 2,551 terms – Never really used Unified Content Descriptor (UCD) – Tagging resource data – 473 terms – Actively used 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar16

18 Common Vocabulary Format Requirements: – Provide term identifiers Unambiguous tagging – Capture semantic relationships Poly-hierarchy structure – Machine processable Allows inter-operability “Machine intelligence” – Avoids problems of: Spelling Case Plurality problems Tags – Automated reasoning: Interested in all “Supernova” Items tagged as “1a Supernova” also returned 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar17

19 SKOS – W3C standard for sharing vocabularies – Based on RDF Semantic model for describing resources – Provides URI for each term – Captures properties of terms – Encodes relationships between terms Enables automated reasoning Standard serialisations “Looser” semantics than OWL – Adopted by IVOA as a standard for vocabularies 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar18

20 Example SKOS Vocabulary Term Example “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” In turtle notation #spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm. 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar19

21 Inter-operable Vocabularies Which vocabulary should I use? One that you know! Closest match to your needs Vocabulary terms related using mappings – Part of the SKOS standard – One mapping file per pair of vocabularies Inter-vocabulary mappings Broad match: – more general term Narrow match: – more specific term Related match: – associated term Exact match: – equivalent term Close match: – similar but not equivalent term 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar20

22 Mapping Editor 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar21

23 Putting it all together Use vocabulary concepts for – Tagging (using URI) Resources in the registry VOEvent packets – Searching by vocabulary concept User keyword search converted to vocabulary URI Provides semantic advantages – Reasoning about terms Relationships (Intra-vocabulary) Mappings (Inter-vocabulary) Requires a mechanism to convert a string to a concept 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar22

24 Vocabulary Explorer Search and browse vocabularies – Configure Vocabularies Mappings Based on Information Retrieval techniques Matching mechanisms Ranking results 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar23 http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/

25 Search Results RunBB2BM25DFR- BM25 IFB2In- expB2 In- expC2 InL2PL2TF-IDF Initial0.930.95 Query Expansion 0.930.94 0.95 0.940.950.94 Term weighting 1 0.930.95 0.96 Term weighting 2 0.930.95 0.96 0.950.96 Combined0.910.94 0.930.94 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar24 Evaluation over 59 queries nDCG evaluation model (distinguishes highly relevant/relevant/not relevant)

26 Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar25

27 Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar26

28 Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar27

29 Vocabulary Explorer Screenshot 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar28

30 Finding the Right Term: Conclusions Vocabularies improve search – Remove ambiguity – Increase precision and recall – Enable Reasoning about relevance Faceted browsing Provided tools for working with vocabularies – Reliable search from keyword string to vocabulary term – Exploration of vocabularies – Mapping terms across vocabularies 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar29

31 Extracting relevant data 2.How do I query the relevant data sources? 23 June 200930A.J.G. Gray - Bolzano-Bozen Seminar

32 Locate, retrieve, and interpret relevant data Heterogeneous publishers – Archive centres – Research labs Heterogeneous data – Relational – XML – Image Files 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar31 Virtual Observatory Virtual Observatory: The Problems

33 A Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Homogeneous view – Mediated global schema Mapping – LAV: local-as-view – GAV: global-as-view 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar32 Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

34 P2P Data Integration Approach Heterogeneous sources – Autonomous – Local schemas Heterogeneous views – Multiple schemas Mappings – From sources to common schema – Between pairs of schema Require common integration data model Can RDF do this? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar33 Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schema j Query 1 Query n Mappings

35 Resource Description Framework W3C standard Designed as a metadata data model Contains semantic details Ideal for linking distributed data Queried through SPARQL 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar34 #foundIn #Sun The Sun #name #MilkyWay Milky Way #name The Galaxy #name IAU:Star rdf:type IAU:BarredSpiral rdf:type

36 SPARQL Declarative query language – Select returned data (projection) Graph or tuples Attributes to return – Describe structure of desired results – Filter data (selection) W3C standard Syntactically similar to SQL – Should be easy for astronomers to learn! 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar35

37 Integrating Using RDF Data resources – Expose schema and data as RDF – Need a SPARQL endpoint Allows multiple – Access models – Storage models Easy to relate data from multiple sources 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar36 Relational DB RDF / Relational Conversion XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query We will focus on exposing relational data sources

38 RDB2RDF: Two Approaches Extract-Transform-Load Data replicated as RDF – Data can become stale Native SPARQL query support – Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support – Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar37

39 System Test Hypothesis Is it viable to perform query- driven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar38 Relational DB RDB2RDF XML DB RDF / XML Conversion Common Model (RDF) Mappings SPARQL query SPARQL query

40 Astronomical Test Data Set SuperCOSMOS Science Archive (SSA) – Data extracted from scans of Schmidt plates – Stored in a relational database – About 4TB of data, detailing 6.4 billion objects – Fairly typical of astronomical data archives Schema designed using 20 real queries Personal version contains – Data for a specific region of the sky – About 0.1% of the data – About 500MB 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar39 Image: SuperCOSMOS Science Archive

41 Analysis of Test Data Using personal version – About 500MB in size (similar size to related work) Organised in 14 Relations – Number of attributes: 2 – 152 4 relations with more than 20 attributes – Number of rows: 3 – 585,560 – Two views Complex selection criteria in views 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar40 Makes this different from business cases and previous work!

42 Is SPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar41

43 Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE { … FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar42

44 Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, 15-20 Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. Radians to degrees)7, 8, 12 Server defined functions10, 11 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar43

45 Expressivity of SPARQL Features Select-project-join Arithmetic in body Conjunction and disjunction Ordering String matching External function calls (extension mechanism) Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar44

46 Analysis of Test Queries Query FeatureQuery Numbers Arithmetic in body1-5, 7, 9, 12, 13, 15-20 Arithmetic in head7-9, 12, 13 Ordering1-8, 10-17, 19, 20 Joins (including self-joins)12-17, 19 Range functions (e.g. Between, ABS)2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By)7-9, 18 Math functions (e.g. power, log, root)4, 9, 16 Trigonometry functions8, 12 Negated sub-query18, 20 Type casting (e.g. radians to degrees)7, 8, 12 Server defined functions10, 11 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar45 Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19

47 Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar46

48 Experiment Time query evaluation – 5 out of 20 queries used – No joins Systems compared: – Relational DB (Base line) MySQL v5.1.25 – RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 – RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar47 Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

49 Experimental Configuration 8 identical machines – 64 bit Intel Quad Core Xeon 2.4GHz – 4GB RAM – 100 GB Hard drive – Java 1.6 – Linux 10 repetitions 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar48

50 Performance Results 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar49 ms 3,450 5,339 21,492 485,932 2,7337,2294,0901,307 17,793 7,468 19,984 372,561 1

51 The Show Stopper: Query Translation Each bound variable resulted in a self-join – RDBMS cannot optimize for this – RDBMS perform badly with self-joins Each row retrieved with a separate query – 1 query becomes n queries, where n is cardinality of relation Predicate selection in RDB2RDF tool – No optimization possible 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar50

52 Extracting Relevant Data: Conclusions SPARQL not expressive enough for real (astronomy) queries RDBMS benefits from 30+ years research – Query optimisation – Indexes RDF stores are improving – Require existing data to be replicated RDB2RDF tools show promise – Need to exploit relational database 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar51

53 Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! More work needed on query translation… 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar52

54 Conclusions & Future Work Traditional Integration Challenges 1.Locating data 2.Extracting relevant data 3.Understanding data Semantic Web Solution SKOS Vocabularies – Removes ambiguity – Enables limited machine understanding RDB2RDF Tools – Requires improved query translation Semantic model mappings – Follow “chains” of mappings – Relies on RDB2RDF work 23 June 2009A.J.G. Gray - Bolzano-Bozen Seminar53


Download ppt "Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester."

Similar presentations


Ads by Google