Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.

Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences and problems

What is the metastore? An RDF triple store which is : Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

Existing Systems 4.3 million Article Headers 8 million References Publishers/Titles Database IngentaConnect Website Preprints 20 million External Holdings Publishers Other Aggregators Article Headers, live system

What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

Architecture of new system RDF Triplestore (PostgreSQL) Master Slave XML API (read only) (Jena) Primary Loader (Jena) JMS Queue IngentaConnect Other Clients JMS Queue Other Systems Customer Data Other loaders / enhancers

RDFS Modelling – What was the data anyway? Standard Vocabularies Dublin Core PRISM FOAF Custom Vocabularies Identifiers Structure Branding Some stats about schemas 28 Classes 72 Properties 4/18 th from Standard Vocabs

Journal XML, with highlights

Hosting description XML, with highlights

OK, enough about your project, tell me something about Jena! OK.... ● How did we choose an RDF Engine? ● Why did we choose Jena? ● What problems did we have? ● Did we solve any of them? ● How did it scale?

How did we choose an RDF Engine? Experimented with Java APIS Jena + PostGreSQL Sesame + PostGreSQL Kowari + native Method of testing

Why did we choose Jena Relational Database backend Usability, Support Easy to debug Schemagen Scalablity

What problems did we have? 1. Insertion - performance 2. Ontologies – memory 3. Encapsulation – limiting flexibility (most problems due to scale..)

The Project - Scale Number of triples = ~200 million and keeps growing Size on disk = 65 Gb Result of loading 4.3 million articles and references Some details of database tables jena_long_lit – ~4.5 million records jena_long_uri - ~0.14 million records

Prob 1. Insertion performance * Task - load backdata * What does that actually involve? For each article: –Get metadata from database 1. –Add metadata from database 2. –Reform into new RDFS model –Query the store – look for relevant resources –Model.read * Problem * Possible Solutions? - Turn off index rebuild - Turn off duplicate checking - Batching

Our solution - Batching * What is batching? * Quantitative effect? * Costs

Prob 2. Ontologies – memory problems * Advantages of ontologies for us? * How did we start? * What was the problem? * Solutions?

Prob 3. Encapsulation – limiting flexibility * Not really a problem with Jena – an experience * Why are we encapsulating the Jena code? * What is the problem with that? * Solutions?

Performance Testing SPARQL Standard query – TITLE TYPE QUERY SELECT ?title ?issue ?article WHERE { ?title rdf:type struct:Journal. ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) }

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } NO TYPES QUERY

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. TITLE TYPE QUERY

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. ?issue rdf:type struct:Issue. ?article rdf:type struct:Article. ALL TYPES QUERY

Performance Testing SPARQL Title type only - <1.5 secs for 150 million triples TEST CONDITIONS Jena 2.3 PostgreSQL 7 Debian Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM

Where are we now with the project? Recent Work * Loaded 4.3 million through batching process, ongoing in place * Non-journal content modelled * REST API Current Work * Replication * SPARQL merging * Phase out batching and use queues instead

Conclusions With a very large triple store: * Loading performance is a challenge * Inferencing is a challenge * SPARQL queries need TLC * Jena scales to 200 million triples * Jena is a good choice for a commercial triplestore

The End

Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.

Similar presentations

Presentation on theme: "Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.

Similar presentations

Presentation on theme: "Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences."— Presentation transcript:

Similar presentations

About project

Feedback