Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences and problems
What is the metastore? An RDF triple store which is : Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
Existing Systems 4.3 million Article Headers 8 million References Publishers/Titles Database IngentaConnect Website Preprints 20 million External Holdings Publishers Other Aggregators Article Headers, live system
What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
Architecture of new system RDF Triplestore (PostgreSQL) Master Slave XML API (read only) (Jena) Primary Loader (Jena) JMS Queue IngentaConnect Other Clients JMS Queue Other Systems Customer Data Other loaders / enhancers
RDFS Modelling – What was the data anyway? Standard Vocabularies Dublin Core PRISM FOAF Custom Vocabularies Identifiers Structure Branding Some stats about schemas 28 Classes 72 Properties 4/18 th from Standard Vocabs
Journal XML, with highlights
Hosting description XML, with highlights
OK, enough about your project, tell me something about Jena! OK.... ● How did we choose an RDF Engine? ● Why did we choose Jena? ● What problems did we have? ● Did we solve any of them? ● How did it scale?
How did we choose an RDF Engine? Experimented with Java APIS Jena + PostGreSQL Sesame + PostGreSQL Kowari + native Method of testing
Why did we choose Jena Relational Database backend Usability, Support Easy to debug Schemagen Scalablity
What problems did we have? 1. Insertion - performance 2. Ontologies – memory 3. Encapsulation – limiting flexibility (most problems due to scale..)
The Project - Scale Number of triples = ~200 million and keeps growing Size on disk = 65 Gb Result of loading 4.3 million articles and references Some details of database tables jena_long_lit – ~4.5 million records jena_long_uri - ~0.14 million records
Prob 1. Insertion performance * Task - load backdata * What does that actually involve? For each article: –Get metadata from database 1. –Add metadata from database 2. –Reform into new RDFS model –Query the store – look for relevant resources –Model.read * Problem * Possible Solutions? - Turn off index rebuild - Turn off duplicate checking - Batching
Our solution - Batching * What is batching? * Quantitative effect? * Costs
Prob 2. Ontologies – memory problems * Advantages of ontologies for us? * How did we start? * What was the problem? * Solutions?
Prob 3. Encapsulation – limiting flexibility * Not really a problem with Jena – an experience * Why are we encapsulating the Jena code? * What is the problem with that? * Solutions?
Performance Testing SPARQL Standard query – TITLE TYPE QUERY SELECT ?title ?issue ?article WHERE { ?title rdf:type struct:Journal. ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) }
Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } NO TYPES QUERY
Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. TITLE TYPE QUERY
Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. ?issue rdf:type struct:Issue. ?article rdf:type struct:Article. ALL TYPES QUERY
Performance Testing SPARQL Title type only - <1.5 secs for 150 million triples TEST CONDITIONS Jena 2.3 PostgreSQL 7 Debian Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM
Where are we now with the project? Recent Work * Loaded 4.3 million through batching process, ongoing in place * Non-journal content modelled * REST API Current Work * Replication * SPARQL merging * Phase out batching and use queues instead
Conclusions With a very large triple store: * Loading performance is a challenge * Inferencing is a challenge * SPARQL queries need TLC * Jena scales to 200 million triples * Jena is a good choice for a commercial triplestore
The End