JUC2006 Scaling Jena in a commercial environment The Ingenta MetaStore Project ● Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences and problems Presented at: Jena User Conference, May 2006, Bristol, UK. TODO:LINK TO PAPER...
JUC2006 What is the metastore? An RDF triple store which is : Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
JUC2006 Existing Systems 4.3 million Article Headers 8 million References Publishers/Title s Database IngentaConnect Website Preprints 20 million External Holdings Publishers Other Aggregators Article Headers, live system
JUC2006 What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
JUC2006 What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable
JUC2006 Architecture of new system RDF Triplestore (PostgreSQL) Master Slave XML API (read only) (Jena) Primary Loader (Jena) JMS Queue IngentaConnect Other Clients JMS Queue Other Systems Customer Data Other loaders / enhancers
JUC2006 RDFS Modelling – What was the data anyway? Standard Vocabularies Dublin Core PRISM FOAF Custom Vocabularies Identifiers Structure Branding Some stats about schemas 28 Classes 72 Properties 4/18 th from Standard Vocabs
JUC2006
JUC2006 Journal XML, with highlights
JUC2006 Hosting description XML, with highlights
JUC2006 Example queries by client: IngentaConnect All properties of a book: PREFIX struct: PREFIX dc: PREFIX rdf: SELECT ?titleid ?prop ?val WHERE { ?titleid rdf:type struct:Book. ?titleid dc:identifier. ?titleid ?prop ?val } Browse by publisher: PREFIX rdf: PREFIX dc: PREFIX struct: SELECT ?pubid ?pubname WHERE { ?pubid rdf:type struct:Publisher. ?pubid dc:title ?pubname }
JUC2006 Example query by background process PREFIX dcterms: PREFIX struct: PREFIX dc: PREFIX linking: PREFIX rdf: SELECT ?article ?date ?hostingDesc WHERE { ?article rdf:type struct:Article. ?article struct:status. ?article dcterms:created ?date. OPTIONAL { ?hostingDesc rdf:type linking:hostingDescription. ?hostingDesc linking:hostedArticle ?article. ?hostingDesc linking:linkingPartner. } FILTER (! bound(?hostingDesc)) FILTER ( ?date > \ T14:35: :00\^^ ) } LIMIT 50 ;
JUC2006 OK, enough about your project, tell me something about Jena! OK.... ● How did we choose an RDF Engine? ● Why did we choose Jena? ● What problems did we have? ● Did we solve any of them? ● How did it scale?
JUC2006 How did we choose an RDF Engine? Experimented with Java APIS Jena + PostGreSQL Sesame + PostGreSQL Kowari + native Method of testing
JUC2006 Why did we choose Jena Relational Database backend Usability, Support Easy to debug Schemagen Scalablity
JUC2006 What problems did we have? 1. Insertion - performance 2. Ontologies – memory 3. Java classes – limiting flexibility (most problems due to scale..)
JUC2006 The Project - Scale Number of triples = ~200 million and keeps growing Size on disk = 65 Gb Result of loading 4.3 million articles and references Some details of database tables jena_long_lit – ~4.5 million records jena_long_uri - ~0.14 million records
JUC2006 Prob 1. Insertion performance * Task - load backdata * What does that actually involve? For each article: – Get metadata from database 1. – Add metadata from database 2. – Reform into new RDFS model – Query the store – look for relevant resources – Model.read * Problem * Possible Solutions? - Turn off index rebuild - Turn off duplicate checking - Batching
JUC2006 Our solution - Batching * What is batching? * Quantitative effect? * Costs
JUC2006 Prob 2. Ontologies – memory problems * Advantages of ontologies for us? * How did we start? * What was the problem? * Solutions?
JUC2006 Prob 3. Java Classes – limiting flexibility? * Not a problem with Jena/scale, but with industrial context * Encapsulate Jena code in DAOs * Java Interface hierarchy to mirror Schema * What is the problem with that? * Solutions?
JUC2006 Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. ?issue rdf:type struct:Issue. ?article rdf:type struct:Article. NO TYPES QUERY TITLE TYPE QUERY ALL TYPES QUERY
JUC2006 Performance Testing SPARQL Title type only - <1.5 secs for 150 million triples TEST CONDITIONS Jena 2.3 PostgreSQL 7 Debian Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM
JUC2006 Where are we now with the project? Recent Work * Loaded 4.3 million through batching process, ongoing in place * Non-journal content modelled * REST API Current Work * Replication * Phase out batching and use queues instead * SPARQL merging with external named graphs
JUC2006 Conclusions With a very large triple store: * Loading performance is a challenge * Inferencing is a challenge * SPARQL queries need TLC * Jena scales to 200 million triples * Jena is a good choice for a commercial triplestore
JUC2006 The End