RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University
Contents Introduction Different Architectures Implications An Example : Jena SDB Evaluations Evaluations using LUBM/DBPedia Open Research Issues Which RDF Store to choose for a particular application? Possible system diagram for Phenotype Annonations.
Introduction What is an RDF store? A system to provide a mechanism for persistent storage and access of RDF graphs. Potential Applications areas: Plenty! Backend for Protege, BioPortal, Phenotype Annotations.
Different Architectures Based on their implementation, can be divided into 3 broad categories : In-memory, Native, Non-native Non- memory. In – Memory : RDF Graph is stored as triples in main – memory. Eg. Storing an RDF graph using Jena API/ Sesame API. Native : Persistent storage systems with their own implementation of databases. Eg. Sesame Native, Virtuoso, AllegroGraph, Oracle 11g. Non-Native Non-Memory : Persistent storage systems set- up to run on third party DBs. Eg. Jena SDB.
Implications Scalability Different query languages supported to varying degrees. Sesame – SeRQL, Oracle 11g – Own query language. Different level of inferencing. Sesame supports RDFS inference, AllegroGraph – RDFS++, Oracle 11g – RDFS++, OWL Prime Lack of interoperability and portability. More pronounced in Native stores.
Jena SDB SDB basically is a Java Loader. Multiple stores supported: MySQL, PostgreSQL, Oracle, DB2. Takes incoming triples and breaks them down into components ready for the database. Multiple layouts Integration with the Joseki server. SPARQL supported. (Non) Interest Declaration: I was previously an intern at HP Labs with the Jena team
Evaluations Third party evaluations for Sesame, Jena SDB, Virtuoso Oracle 11g company evaluations Methodology LUBM – Lehigh University BenchMark DBPedia Multiple Queries Load Times
Evaluations DB Pedia – Database of structured information extracted from Wikipedia. Information about places, persons, music albums and films[2] LUBM – Synthetically generated RDF data containing universities, departments, students etc.[1] Dataset size: DataSet1: 15,472,624 triples; 2.1 GB DataSet 2: LUBM 50 – 2.75 Million & LUBM 1000 – Million 3 Queries
Loading Time-DataSet1
Results – Query 1 Simple select query – 2 variables
Query 2 Unconstrained Select Query – only predicate was specified.
Query 3 Complex Query – Uses filter
Oracle 11g – DataSet 2 Ontology (size)RDFSOWL Prime TriplesTimeTriplesTime LUBM – 50(6.8 Million)2.75 M12.14 min3.05 M8.01 min LUBM – 1000(133.6 M)55.09M7h 19m65.25M7h 12m
Observations Native Stores perform better than systems using third party stores. Optimizations are possible Each of the systems uses different database layouts. Virtuoso – OGPS,POGS,PSOG,SOPG SDB – SPO,GSPO Hashing on SDB is very bad.
Open Research Issues Inferencing[4] Present common implementations: Make a number of small queries to propagate the effects of rule firing. Each of these queries creates an interaction with the database. Not very efficient Approaches Snapshot the contents of the database-backed model into RAM for the duration of processing by the inference engine. Performing inferencing in-stream. Precompute the inference closure of ontology and analyze the in-coming data-streams, add triples to it based on your inference closure. Assumes rigid seperation of the RDF Data(A-box) and the Ontology data(T-box) Even this maynot work for very large ontologies – BioMedical Ontologies
Open Research Issues Query Optimization Third party stores undo’s any optimization done at the API level. Better performance of native stores points to that direction. Some work in optimizing SPARQL queries for in-memory story.
Which RDF store to choose for an app? Frequency of loads that the application would perform. Single scaling factor and linear load times. Level of inferencing. Support for which query language. W3C recommendations. Special system needs. Eg. Allegograph needs 64 bit processor.
Phenotype Annotations Set of Ontologies required for Phenotype Annotationseg. PATO, Fly etc. j Jena ModelSDB MySQL / Virtuoso Phenotype Annotations Jena API Inferencing Jena API j Jena ModelSDB
References [1] [2] [3] Kurt Rohloff et al.: An Evaluation of Triple-Store Technologies for Large Data Stores. Comparing Sesame, Jena and AllegroGraph. 2007An Evaluation of Triple-Store Technologies for Large Data StoresAllegroGraph [4]N Bhatia, A Seaborne – ‘Ingestion pipeline for RDF’Ingestion pipeline for RDF