Towards linked sensor data Analysis of project task, tools and Hackystat architecture Author: Myriam Leggieri GSoC 2009 project for Hackystat
Overview Hackystat Architecture Data Flow Modifications to the Data flow Modifications to the Subsystem Project Task Why RDF How RDF solves problems RDF Model (->to be added tomorrow after a final revision) Metadata of Sensor Data Sensor Data Type RDF Storage Requirements Performance Relational DB support OpenLink Virtuoso Semantic Web Framework for Java Jena Sesame Conclusions
Hackystat Architecture Data Flow Sensor_1Sensor_nSensor_2 ……. Text/XML data Unmarshall and Marshall of Text/XML data Text/XML data Unmarshall and Marshall of Text/XML data REST API SensorBase (PUT) (GET) REST API DailyProjectData Telemetry Server-side Client-side Text/XML data (GET)
Project Task Modifications to the Data flow Add RDF representation of Sensor Data REST API SensorBase REST API DailyProjectData Telemetry DataBase Containing either simple data and RDF triples Forwarding of received requests for RDF RDF_Manager Necessary to enhance performance Server-side
Project Task Modifications to the Subsystem Extensible and Configurable along 3 dimensions: 1.The set of Sensors 2.The set of Sensor Data Types 3.The set of Applications Modules organized into 4 Subsystems: CORE Basic framework mechanisms APP Applications that Generate useful Analyses over The Sensor Data SENSOR Implement Sensors For Development tools SDT Implement Sensor Data Types Server-side Server/Client-side Client-side TASK: Add basic framework mechanisms to handle RDF representations (Server-side)
Project Task Why RDF Sensor data originally built for human consumption machine-readable but not machine-understandable PROBLEMS: 1.hard to automate their manipulation 2.especially their aggregation -> sparseness and redundancy of data SOLUTION: Metadata to describe the available sensor data The Resource Description Framework (RDF) is the W3C Recommendation for describing resources And it’s Domain-independent
Project Task How RDF solves problem 1 Framework = basic conceptual structure used to solve or address complex issues 1.Resource (conceptual mapping of entities) 2.Property (particular feature characterizing a resource) 3.Statements (triple in the form of (subject, predicate, object)) ResourcePropertyResource or Literal Resource But which is the meaning of ‘Creator’? RDF Schema = collection of classes organized in hierarchy defining terms used in the model and restrictions on their usage. Sort of vocabulary -> Machine-understandable
Project Task How RDF solves problem 2 Different meanings for the same resource == Different namespaces associated with that resource Different namespace can be combined == Different ways of classifying the world can be combined + Different schema linkable through proper properties (e.g. rdfs:subClassOf, rdfs:subPropertyOf, rdfs:seeAlso) And easily mergeable Easily aggregation of sparse data and integration of redundancy data Example: RSS 1.0 describes web resources using title, description and link extended here by adding modules under different namespaces -> further information added <rdf:RDF xmlns:rdf=" xmlns:dc=" xmlns=" > XML: A Disruptive Technology XML is placing increasingly heavy loads on the existing technical infrastructure of the Internet. The O'Reilly Network Simon St.Laurent Copyright © 2000 O'Reilly & Associates, Inc. XML
RDF Storage Requirements embedded DB = Apache Derby Hackystat has been ported also to PostGreSQL server Microsoft SQL server SHOULD Use the W3C recommended SPARQL as query language Support large dataset Means to implement owl:sameAs inference Support at least the same Relational DB supported by Hackystat
RDF Storage Performance RDF Stores: Vituoso : Sesame Jena TDB Jena SDB Relational DB-to-RDF wrappers: D2R server Virtuoso – RDF Views All Support SPARQL Which are their performace? rewrites SPARQL queries into SQL queries against an application-specific relational schemata based on a mapping The Berlin SPARQL Benchmark (BSBM) compares the performance of storage systems that expose SPARQL endpoints Performance increase of the SUTs (System Under Test) between the second query mix and the average query mix in steady state: Load times:
RDF Storage Relational DB support Relational DB/RDF DB SesameJenaVirtuosoD2RSer ver HSQLDBXVVX MySQLVVVV Postgre SQL VVVV OracleVVVV MS SQL Server XVVV Apache Derby XVVX has two unsolved issues (though the critical one can be workaround)
RDF Storage OpenLink Virtuoso has a general-purpose relational database engine enhanced with RDF-oriented data types (e.g. IRIs and language and type-tagged strings). RDF data may be stored as RDF quads (i.e., graph, subject, predicate, object tuples) RDF data may also be generated-on-demand by SPARQL queries against a virtual graph mapped from relational data, which may reside in Virtuoso tables or tables managed by any third party RDBMS Present heterogeneous RDBMS-es as a single consistent SQL queriable data universe Virtuoso RDF Views allows mapping arbitrary collections of relational tables, views, procedures, or web services into SPARQL accessible RDF. The RDF data is constructed on demand by evaluating SQL queries and stored procedures generated on the fly as part of a SPARQL query-processing pipeline. A Virtuoso Jena Provider (Native Graph Model Storage Provider for the Jena Framework ) And A Virtuoso Sesame Provider (Native Graph Model Storage Provider for the Sesame Framework ) exists
Semantic Web framework for Java Jena Applications interact With An abstract model
Semantic Web framework for Java Sesame defines interfaces and implementation for all basic RDF entities RDF parsers/writers from/to statement/file developer-oriented methods for uploading data files, querying, and extracting and manipulating data (implementations are e.g. SailRepository and HttpRepository) For a client/server implementation Java Servlets that implement a protocol for accessing Sesame repositories over HTTP (there are client libraries To use this protocol, e.g. HttpClient Used by HttpRepository) JDBC Memory Native Store data directly to disk (instead of in main memory) In a binary format optimized for Compact storage and fast retrieval abstract from the storage and inference details, allowing various types of storage and inference to be used (implemetations are e.g. MemoryStore, NativeStore, JDBCStore) For a local implementation 3 types of queries Depending on the returned type: tuples, graphs, boolean 2 Query-Languages supported: SeRQL, SPARQL (a W3C recommendation)
Conclusions There are the following possibilities to choose between: 1.Using the Jena API 2.Using OpenLink Virtuoso + Jena OR OpenLink Virtuoso + Sesame OpenLink Virtuoso As Relational-to-RDF wrapper OpenLink Virtuoso As Relational Database engine Only for RDF storageFor RDF, XML and any kind of storage (substituting any other existing relational DB)
Conclusions Jena PROS with respect to Sesame: 1.supports Derby and the most common relational DB 2.Simplicity As RDF storage system: 1.Quite good performance during benchmark (using Jena TDB) CONS: 1.Doesn’t provide REST API As RDF storage system: 1.Poor support to large dataset Sesame PROS with respect to Jena: 1.Availability as web application through REST API 2.More complete set of functionality especially the ones web-oriented As RDF storage system: 1.Better support to large dataset CONS: As RDF storage system: 1.Poor performance during benchmark OpenLink Virtuoso PROS: 1.Supports any relational DB 2.Uncomparable better performance on benchmarks 3.Present heterogeneous RDBMS-es can be viewed as a single consistent SQL queriable data universe Which is the most suitable combination of tools?