Building and Managing a Massive Triplestore: an experience report Priya Parvatikar & Katie Portwin, Ingenta Big Fat TripleStore
Purpose of this talk 1. Our project – why did we get into RDF anyway? 2. Scale – woah it's a bit big 3. Compromises.. due to scale Modelling Loading Querying (incl performance testing) Assumes: knowledge of RDF
Our Project
Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Obligatory background - How get into situation of doing RDF store? - First, what is the metadata - Ingenta hosts scholarly journal articles. Eg journal BMJ, tech reports from OECD - Metadata about journal articles - Researchers access via IC website - 2m sessions a month -> quick demo.
Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.
1. Multiple systems – synching problem.. Publishers 4.3 million Article Headers Preprints 8 million References Other Aggregators Article Headers, live system As you can see from this diagram Number of data silos, holding subsets of data, duplicated data.... Some are filestores, others are RDBMSes for example – the Article headers are stored on the filesystem in SGML format References are stored in a mysql database Holdings data in yet another database There are also a variety of applications serving a number of clients and querying a variety of databases Primary client being the ingentaconnect website,. Maintenance problems synchronization problems So MetaStore has been designed to be a CENTRALISED data store to ease such problems 20 million External Holdings Publishers/Titles Database IngentaConnect Website
Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.
2. Things aren't modelled as we'd like them to be. References Article Headers, live system 8 million References A A X B Y Z Z Z
Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.
Why did we choose RDF? · Merge existing stores - Flexibility · Accommodate future wacky data · Add more relationships -Cross-linking · Take the pain out of distribution - Standard Vocabularies Decided to go with RDF as the storage technology because * need to merge data from various legacy databases – so need a flexible model * RDF is inherently flexible * Need the store to be future-proof . Be able to cater to various new requirements easily. Again flexibility of RDF helps * Ability to add new relationships between data. Add relationships without changing entities. So produce a richer dataset * Distribution agreements with third parties. Traditionally involves a number of munging scripts. Tired of that!! Adopting RDF will lead to using industry standard vocabularies. Easing the distribution process
Why did we choose Jena? · Java · Scalability, scalability, scalability · RDBMS Backend - We are using Jena to persist our triplestore. - Java - in house expertise - commercial consideration - rules out Redland etc - Scalability - initial testing, 50m triples - kowari memory problems - sesame performance problems - "confidence" that it would scale - RDBMS - Maintenance - backup - replicate - handover to sysadmin - Nosey - not really our business.. but during debugging - familiar interface
- postgresql client - simple query: select first 3 triples - subj, prop, obj - URI, literal - unencrypted
Scale
What is the scale of the project? 4.3 million articles and references 200 million triples 65Gb on disk 100Gb ?
Experiences - Modelling
Modelling... the initial plan · What do we mean, 'modelling'? · Standard vocabularies: PRISM, DC, FOAF · Custom vocabularies · An example -> Modelling - Developing schemas and vocabularies to represent the various entities and relationships between them. Tried to use standatd vocabularies as far as possible, eg .DC PRISM, FOAF Had to create custom vocabularies where standard vocabularies did not appear to suffice Example
/pubs/200 /pubs/10 /titles/100 /parts/1 foaf:Person /articles/1 struct:Author 5 /authors/150 /authors/150 Intestinal helminths and malnutrition are independantly associated with protection from cerebral malaria in Thailand 10.1179/000349802125000448 Flintstone Fred Although human infection with Ascaris appears to be associated with protection from cerebral malaria, there are many socio-economic and nutritional confounders related to helminth....... Oxford University
Modelling – the messy bits Compromises 1. Audit trail – minimising bloat 2. Multiple languages – minimising bloat 3. Generous adoption of identifiers - that's all very nice, but.. actually compromises - various reasons - bring 3 examples to give a flavour
Modelling – the messy bits 1. Audit Trail and Minimising Bloat 2005-12-07T16:24:19 2005-12-07T16:24:20 /updater/2.0 /articles/1 Intestinal helminths and malnutrition are independantly associated with protection from cerebral malaria in Thailand 5 /authors/100 /parts/1 10.1179/000349802125000448 Although human infection with Ascaris appears to be associated with protection from cerebral malaria, there are many socio-economic and nutritional confounders related to helminth....... /authors/150 2 _x rdf:type rdf:Statement . _x rdf:subject /articles/1 . _x rdf:predicate prism:startingPage . _x rdf:object 5 . _x dcterms:created 2005-12-07T16:24:19 . _x dcterms:modified 2005-12-07T16:24:20 . /articles/1 5 - What I mean by an audit trail - whenever update, say who and when - Elegant: reification of all statements - 65Gb.. 8 extra triples for each -> over 500Gb ! - So a compromise: per significant resource. - Not ideal as elegant model, and don't have all the data
Modelling – the messy bits 2. Multiple Language Abstracts – minimising bloat Another compromise due to scale – dealing with multiple languages for articles Example article with single lang abstract – describe the modelling Example with multi-language abs – Describe modelling with xml:lang BUT abstracts typically have XHTML markup eg. Bold tags So need parseType=”literal” along with the xml:lang BUT RDF Spec does not allow for both together. So this solution invalid# Another elegant theoretical soln – create BNODE for every language that the article has and attach the llanguage and data as properties of the BNODE All fine, BUT 99% of abstracts are in a single lang english. So will look like this...also since only eng, can remove the dc:language propety...so this... So 4 extra triples per article i.e. 16 million extra triples...will affect query perf...unnecessary bloat...so NOT VIABLE So compromise solution is to have xhtml:span attached to every dc:description for every lang, COSTS * Cannot query store for multiple langs * Querying apps have to do XML parsing to get out the muli lang abs But cannot be helped as in this case, the effect on query performance does not jusity the bloat given the probability of occurrence.
<struct:Article rdf:about="/articles/3279690"> <dc:description xml:lang=”en”> <dc:description xml:lang=”de”> Dieser Artikel untersucht den Einfluss von Ludwig van Beethovens Werk... </dc:description> </struct:Article> <dc:description xml:lang=”en” rdf:parseType=”Literal”> <dc:description xml:lang=”de”> Dieser Artikel untersucht den Einfluss von Ludwig van Beethovens Werk... </dc:description> </struct:Article> <dc:description> </struct:Article> This article examines the impact of <b>Ludwig van Beethoven's</b> work... This article examines the impact of Ludwig van Beethoven's work... </dc:description>
1. <dc:description xml:lang=”en” rdf:parseType=”Literal”> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> </struct:Article> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:language>en</dc:language> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> </struct:Article> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:language>en</dc:language> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> <dc:language>de</dc:language> <dc:description rdf:parseType=”Literal”>Dieser Artikel..</dc:description> </struct:Article> 4 X 4.3 mill articles -> 16 mill triples 3. <struct:Article rdf:about="/articles/3279690"> <dc:description rdf:parseType=”Literal”> <xhtml:span xml:lang=”en”>This article..</xhtml:span> <xhtml:span xml:lang=”de”>Dieser Artikel..</xhtml:span> </dc:description> </struct:Article>
Modelling – the messy bits 3. Generous adoption of identifiers <struct:Article rdf:about="http://metastore.ingenta.com/content/articles/42"> <ident:doi>10.1179/037178403225003582</ident:doi> <ident:sici>0371-7844(20031201)112:3L.141;1-</ident:sici> <ident:infobike rdf:resource="infobike://maney/mint/2003/0000112/0000003/art001"/> <linking:genlinkerRefId rdf:resource="genlinker://refid/5518" /> </struct:Article> - Final compromise to the model is accepting crummy data. - Modelled an article, nice shiny new identifier, nice bibliographic data - But have to deal with integrating with legacy stores - Say to other eng / db teams – now use this store - They have primary keys - Modelled them by subclassing dc:identifier, but in own namespace, to try to keep them away. - eg linking. - also industry standard ones such as doi and sici - multiple identifiers, integration, solved by being super tolerant – but compromises the elegant model.
Experiences - Loading
Loading... the initial plan Bulk transfer : Headers: 4.3 million SGML files References: 8 million row RDBMS Intelligent linking between resources -> many queries for every resource.. Started with loading the data from the legacy datasets into the new store Legacy sets mainly involved SGML filestore of 4.3mill headers and RDBMS db of 8mill references. Might think loading would involve getting data out of databases, convert into RDF?XML and inserting into store. HOWEVER is more than this... BECAUSE we want to establish effective intelligent crosslinks between resources. For example, where an author has written two articles, we want to have a single Author resource and link the two articles to it. This naturally involves multiple queries of the store whilst creating the RDF for load. For example while loading an article, first check if article already exists. Then check if the various entities associated with the article already exist, by virtue of being created for other articles eg. Authors, Keywords etc. So multiple queries per insert
Loading... the initial plan SELECT ?pub ?title ?issue ?article WHERE { ?title dc:identifier <http://metastore.ingenta.com/content/issn/11111111> . ?issue prism:isPartOf ?title . ?issue prism:volume ?volumeLiteral . ?issue prism:number ?issueLiteral . ?article prism:isPartOf ?issue . ?article prism:startingPage ?firstPageLiteral . FILTER ( ?volumeLiteral = "2" ) FILTER ( ?issueLiteral = "3" ) FILTER ( ?firstPageLiteral = "4" ) } SELECT ?auth WHERE{ ?auth foaf:firstName ?firstname . ?auth foaf:family_name ?surname . FILTER (?firstname = “Fred”) FILTER (?surname = “Flinstone”) } Example query using SPARQL for finding if an article already exists using literal metadata So we developed our loading programs using such queries and set it going.
Loading – a big problem 3 months Bottlenecks Insertion SPARQL queries Identifier retrieval is fast * So, wrote loading programs, tested them on small data sets, all nice * But.. as store grew, started to be concerned about performance * Calc on back of envelope – about 3 months .. problem... looked for ways to optimise * bottleneck at SPARQL queries like the one Priya showed you * also looking at our logs noticed a solution: identifier queries fast * identifier query I mean lookup via dc:ident of resource – give me subject whose object is this URI.
Loading – the messy workaround <struct:Article> <dc:identifier rdf:resource= "http://metastore.ingenta.com/content/11111111/v2n3/p4"/> </struct:Article> String id = "http://metastore.ingenta.com/content/11111111/v2n3/p4"; model.listSubjectsWithProperty(DC_11.identifier, model.getResource(id)); But costly! So: 1. we added predictable identifiers like this – predictable because made up of literals used in the query 2. And made loader use these where poss (falls back where <>1 result) * You are looking at this saying yuk. Breaks all database modelling rules.. - Have to update when update - also bloats store, something we're strictly avoiding - but we didn't think of a better option. A compromise, due to scale.
Experiences - Querying
How do we query the store? SPARQL vs RDQL SPARQL vs SQL Some lessons learnt while developing queries for applications Started with RDQL, but quickly found that we needed a richer query language while re- dev'ing client applications. So switched to SPARQL. So would advise people to start with SPARQL
Redeveloping queries SELECT ?article ?date ?hostingDesc WHERE { ?article rdf:type struct:Article . ?article struct:status http://metastore.ingenta.com/content/status/bare> . ?article dcterms:created ?date . OPTIONAL { ?hostingDesc rdf:type linking:HostingDescription . ?hostingDesc linking:hostedArticle ?article . ?hostingDesc linking:linkingPartner <http://www.crossref.org> } FILTER ( ?date > "2004-10- 06T00:00:00.109+01:00"^^<http://www.w3.org/2001/XMLSchema#dateT ime> ) FILTER ( ! bound(?hostingDesc) ) An example SPARQL query that we have developed. For finding a list of articles that we should poll CrossRef for ...we need to find those articles that we do not host ourselves and those that we have not already polled before. This kind of query richness was afforded by SPARQL and resultled in the query above.
SQL Equivalent SELECT refs.ref_id FROM sources, refs LEFT OUTER JOIN matches ON refs.ref_id=matches.ref_id WHERE sources.ref_id=refs.ref_id AND tags_doi IS NULL AND date_loaded > 20041006000000; This is the equivalent SQL query that was used by the client app while querying similar data from an RDBMS. If you contrast both queries, you will find that the SPARQL query is considerably more verbose and complex as compared to equivalent SQL. So we concluded that SPARQL provided us with the richness of language that we needed while developing queries. But it can quite a difficult language to get to grips with initially and can have quite a steep learning curve.
But what about IngentaConnect? We are supposed to provide a webservice to the front end team * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite
But what about IngentaConnect? We are supposed to provide a webservice to the front end team The good news: They want to do reasonably fixed queries. * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite
But what about IngentaConnect? We are supposed to provide a webservice to the front end team The good news: They want to do reasonably fixed queries. The bad news: They want 4 per second... SPARQL service?? * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite
Query Performance Testing SPARQL QUERY SELECT ?pub ?title ?issue ?article WHERE { ?title rdf:type struct:Title . ?title dc:identifier <http://metastore.ingenta.com/content/issn/11111111> . ?issue prism:isPartOf ?title . ?issue prism:volume ?volumeLiteral . ?issue prism:number ?issueLiteral . ?article prism:isPartOf ?issue . ?article prism:startingPage ?firstPageLiteral . FILTER ( ?volumeLiteral = "2" ) FILTER ( ?issueLiteral = "3" ) FILTER ( ?firstPageLiteral = "4" ) } IDENTIFIER QUERY String id="http://metastore.ingenta.com/content/articles/42"; Resource ires = model.getResource(id); So, 2 queries: * an identifier query like this ( without getting into Jena.. model.getResource..)
Query Performance Testing Jena 2.3 PostgreSQL 7 Debian 3.1 Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives in RAID5 - Ultra320 (15,000 rpm) 4G RAM DCIDENT: 23ms (/150m) SPARQL: 1.4s (/150m) Here are the results of the performance testing These are the test conditions – don't go through them Red line represents dc:identifier query results – give numbers Blue line represents SPARQL results – give numbers identifier based query not much affected by size of store, continues to perform well SPARQL degrades as store size increases
Query Performance Testing Identifier queries – 23ms SPARQL – 1.4 secs IngentaConnect – identifier Flexible development - SPARQL Some lessons we learnt Identifier queries perform very well SPARQL queries are not as fast, but still performwithin acceptable limits For real-time apps where query perf is critical eg. Webservices IngentaConnect – use identifier queries where possible For apps where you need a richer query lang than identifiers can afford , or where you can live with slower performance times eg, batchprocessing apps, SPARQL can be used as it is still workable
Conclusions Flexibility of RDF / RDFS helped us with an integration problem. RDBMS backend is good. We supplemented FOAF, DC, PRISM with Custom vocabs. Loading is really all about querying – if you want to do intelligent linking – which you do! Predictable identifiers – though nasty - improve query performance. SPARQL is handier than RDQL. But it is quite hard! Jena scaled to 200m triples. SPARQL performance OK. Compromises are unavoidable in modelling – eg weigh benefit against bloat.
The End Big Fat TripleStore