Building and Managing a Massive Triplestore: an experience report

Slides:



Advertisements
Similar presentations
Lecture 13 Page 1 CS 111 Online File Systems: Introduction CS 111 On-Line MS Program Operating Systems Peter Reiher.
Advertisements

 Copyright 2004 Digital Enterprise Research Institute. All rights reserved. SPARQL Query Language for RDF presented by Cristina Feier.
Beyond data modeling Model must be normalised – purpose ? Outcome is a set of tables = logical design Then, design can be warped until it meets the realistic.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Triple Stores.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Objectives of the Lecture :
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Publishing data on the Web (with.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
CS 474 Database Design and Application Terminology Jan 11, 2000.
PHP meets MySQL.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
Views Lesson 7.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Grade Book Database Presentation Jeanne Winstead CINS 137.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Practical RDF Chapter 10. Querying RDF: RDF as Data Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Lim.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
1 Overview of the Hub Concept & Prototype for Secure Method of Information Exchange (SMIE) April 2013 Prepared by NZ & USA.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
Chapter 12 Introducing Databases. Objectives What a database is and which databases are typically used with ASP.NET pages What SQL is, how it looks, and.
JUC2006 Scaling Jena in a commercial environment The Ingenta MetaStore Project ● Purpose ● Give an example of a big, commercial app using Jena. ● Share.
Fundamentals of DBMS Notes-1.
JavaScript/ App Lab Programming:
CSE 103 Day 20 Jo is out today; I’m Carl
N-Tier Architecture.
Data Virtualization Demoette… Flat-File Data Sources
Triple Stores.
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
LOCO Extract – Transform - Load
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Chris Menegay Sr. Consultant TECHSYS Business Solutions
The Client/Server Database Environment
Lecture 2 The Relational Model
Alejandro Álvarez on behalf of the FTS team
Working the A to Z List enhance journal access in the OPAC
Teaching slides Chapter 8.
Triple Stores.
Normalization By Jason Park Fall 2005 CS157A.
Lecture 1: Multi-tier Architecture Overview
ARCH-1: Application Architecture made Simple
Cloud computing mechanisms
IL Step 3: Using Bibliographic Databases
Reference This talk is loosely based on the following
INFO/CSE 100, Spring 2006 Fluency in Information Technology
WISER Humanities: Keeping up to date
Agile testing for web API with Postman
Computer Science Projects Database Theory / Prototypes
LOD reference architecture
Advanced Database Concepts: Reports & Views
INFO/CSE 100, Spring 2006 Fluency in Information Technology
RDF David R Newman 15 July 2009.
Triple Stores.
Triple Stores.
Normalization By Jason Park Fall 2005 CS157A.
Presentation transcript:

Building and Managing a Massive Triplestore: an experience report Priya Parvatikar & Katie Portwin, Ingenta Big Fat TripleStore

Purpose of this talk 1. Our project – why did we get into RDF anyway? 2. Scale – woah it's a bit big 3. Compromises.. due to scale Modelling Loading Querying (incl performance testing) Assumes: knowledge of RDF

Our Project

Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Obligatory background - How get into situation of doing RDF store? - First, what is the metadata - Ingenta hosts scholarly journal articles. Eg journal BMJ, tech reports from OECD - Metadata about journal articles - Researchers access via IC website - 2m sessions a month -> quick demo.

Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.

1. Multiple systems – synching problem.. Publishers 4.3 million Article Headers Preprints 8 million References Other Aggregators Article Headers, live system As you can see from this diagram Number of data silos, holding subsets of data, duplicated data.... Some are filestores, others are RDBMSes for example – the Article headers are stored on the filesystem in SGML format References are stored in a mysql database Holdings data in yet another database There are also a variety of applications serving a number of clients and querying a variety of databases Primary client being the ingentaconnect website,. Maintenance problems synchronization problems So MetaStore has been designed to be a CENTRALISED data store to ease such problems 20 million External Holdings Publishers/Titles Database IngentaConnect Website

Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.

2. Things aren't modelled as we'd like them to be. References Article Headers, live system 8 million References A A X B Y Z Z Z

Why did we sign up for all this work? What is the metadata? What is Ingenta? The database seems to be working, why are we replacing it? - Databse is more or less working - what is the problem? 1. Multiple databases - duplicated in various forms - synching - pull all together to generate page -> want to consolidate in a central store 2. Journal model.. actually now more types of content - Books, supplementatry data, Virtual Journals - Confidence in future 1. Multiple systems – synching problem. 2. Things aren't modelled as we'd like them to be. 3. Publishers want new wacky linking.

Why did we choose RDF? · Merge existing stores - Flexibility · Accommodate future wacky data · Add more relationships -Cross-linking · Take the pain out of distribution - Standard Vocabularies Decided to go with RDF as the storage technology because * need to merge data from various legacy databases – so need a flexible model * RDF is inherently flexible * Need the store to be future-proof . Be able to cater to various new requirements easily. Again flexibility of RDF helps * Ability to add new relationships between data. Add relationships without changing entities. So produce a richer dataset * Distribution agreements with third parties. Traditionally involves a number of munging scripts. Tired of that!! Adopting RDF will lead to using industry standard vocabularies. Easing the distribution process

Why did we choose Jena? · Java · Scalability, scalability, scalability · RDBMS Backend - We are using Jena to persist our triplestore. - Java - in house expertise - commercial consideration - rules out Redland etc - Scalability - initial testing, 50m triples - kowari memory problems - sesame performance problems - "confidence" that it would scale - RDBMS - Maintenance - backup - replicate - handover to sysadmin - Nosey - not really our business.. but during debugging - familiar interface

- postgresql client - simple query: select first 3 triples - subj, prop, obj - URI, literal - unencrypted

Scale

What is the scale of the project? 4.3 million articles and references 200 million triples 65Gb on disk 100Gb ?

Experiences - Modelling

Modelling... the initial plan · What do we mean, 'modelling'? · Standard vocabularies: PRISM, DC, FOAF · Custom vocabularies · An example -> Modelling - Developing schemas and vocabularies to represent the various entities and relationships between them. Tried to use standatd vocabularies as far as possible, eg .DC PRISM, FOAF Had to create custom vocabularies where standard vocabularies did not appear to suffice Example

/pubs/200 /pubs/10 /titles/100 /parts/1 foaf:Person /articles/1 struct:Author 5 /authors/150 /authors/150 Intestinal helminths and malnutrition are independantly associated with protection from cerebral malaria in Thailand 10.1179/000349802125000448 Flintstone Fred Although human infection with Ascaris appears to be associated with protection from cerebral malaria, there are many socio-economic and nutritional confounders related to helminth....... Oxford University

Modelling – the messy bits Compromises 1. Audit trail – minimising bloat 2. Multiple languages – minimising bloat 3. Generous adoption of identifiers - that's all very nice, but.. actually compromises - various reasons - bring 3 examples to give a flavour

Modelling – the messy bits 1. Audit Trail and Minimising Bloat 2005-12-07T16:24:19 2005-12-07T16:24:20 /updater/2.0 /articles/1 Intestinal helminths and malnutrition are independantly associated with protection from cerebral malaria in Thailand 5 /authors/100 /parts/1 10.1179/000349802125000448 Although human infection with Ascaris appears to be associated with protection from cerebral malaria, there are many socio-economic and nutritional confounders related to helminth....... /authors/150 2 _x rdf:type rdf:Statement . _x rdf:subject /articles/1 . _x rdf:predicate prism:startingPage . _x rdf:object 5 . _x dcterms:created 2005-12-07T16:24:19 . _x dcterms:modified 2005-12-07T16:24:20 . /articles/1 5 - What I mean by an audit trail - whenever update, say who and when - Elegant: reification of all statements - 65Gb.. 8 extra triples for each -> over 500Gb ! - So a compromise: per significant resource. - Not ideal as elegant model, and don't have all the data

Modelling – the messy bits 2. Multiple Language Abstracts – minimising bloat Another compromise due to scale – dealing with multiple languages for articles Example article with single lang abstract – describe the modelling Example with multi-language abs – Describe modelling with xml:lang BUT abstracts typically have XHTML markup eg. Bold tags So need parseType=”literal” along with the xml:lang BUT RDF Spec does not allow for both together. So this solution invalid# Another elegant theoretical soln – create BNODE for every language that the article has and attach the llanguage and data as properties of the BNODE All fine, BUT 99% of abstracts are in a single lang english. So will look like this...also since only eng, can remove the dc:language propety...so this... So 4 extra triples per article i.e. 16 million extra triples...will affect query perf...unnecessary bloat...so NOT VIABLE So compromise solution is to have xhtml:span attached to every dc:description for every lang, COSTS * Cannot query store for multiple langs * Querying apps have to do XML parsing to get out the muli lang abs But cannot be helped as in this case, the effect on query performance does not jusity the bloat given the probability of occurrence.

<struct:Article rdf:about="/articles/3279690"> <dc:description xml:lang=”en”> <dc:description xml:lang=”de”> Dieser Artikel untersucht den Einfluss von Ludwig van Beethovens Werk... </dc:description> </struct:Article> <dc:description xml:lang=”en” rdf:parseType=”Literal”> <dc:description xml:lang=”de”> Dieser Artikel untersucht den Einfluss von Ludwig van Beethovens Werk... </dc:description> </struct:Article> <dc:description> </struct:Article> This article examines the impact of <b>Ludwig van Beethoven's</b> work... This article examines the impact of Ludwig van Beethoven's work... </dc:description>

1. <dc:description xml:lang=”en” rdf:parseType=”Literal”> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> </struct:Article> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:language>en</dc:language> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> </struct:Article> 2. <struct:Article rdf:about="/articles/3279690"> <?:abstract> <?:Abstract> <dc:language>en</dc:language> <dc:description rdf:parseType=”Literal”>This article..</dc:description> </?:Abstract> </?:abstract> <dc:language>de</dc:language> <dc:description rdf:parseType=”Literal”>Dieser Artikel..</dc:description> </struct:Article> 4 X 4.3 mill articles -> 16 mill triples 3. <struct:Article rdf:about="/articles/3279690"> <dc:description rdf:parseType=”Literal”> <xhtml:span xml:lang=”en”>This article..</xhtml:span> <xhtml:span xml:lang=”de”>Dieser Artikel..</xhtml:span> </dc:description> </struct:Article>

Modelling – the messy bits 3. Generous adoption of identifiers <struct:Article rdf:about="http://metastore.ingenta.com/content/articles/42"> <ident:doi>10.1179/037178403225003582</ident:doi> <ident:sici>0371-7844(20031201)112:3L.141;1-</ident:sici> <ident:infobike rdf:resource="infobike://maney/mint/2003/0000112/0000003/art001"/> <linking:genlinkerRefId rdf:resource="genlinker://refid/5518" /> </struct:Article> - Final compromise to the model is accepting crummy data. - Modelled an article, nice shiny new identifier, nice bibliographic data - But have to deal with integrating with legacy stores - Say to other eng / db teams – now use this store - They have primary keys - Modelled them by subclassing dc:identifier, but in own namespace, to try to keep them away. - eg linking. - also industry standard ones such as doi and sici - multiple identifiers, integration, solved by being super tolerant – but compromises the elegant model.

Experiences - Loading

Loading... the initial plan Bulk transfer : Headers: 4.3 million SGML files References: 8 million row RDBMS Intelligent linking between resources -> many queries for every resource.. Started with loading the data from the legacy datasets into the new store Legacy sets mainly involved SGML filestore of 4.3mill headers and RDBMS db of 8mill references. Might think loading would involve getting data out of databases, convert into RDF?XML and inserting into store. HOWEVER is more than this... BECAUSE we want to establish effective intelligent crosslinks between resources. For example, where an author has written two articles, we want to have a single Author resource and link the two articles to it. This naturally involves multiple queries of the store whilst creating the RDF for load. For example while loading an article, first check if article already exists. Then check if the various entities associated with the article already exist, by virtue of being created for other articles eg. Authors, Keywords etc. So multiple queries per insert

Loading... the initial plan SELECT ?pub ?title ?issue ?article WHERE { ?title dc:identifier <http://metastore.ingenta.com/content/issn/11111111> . ?issue prism:isPartOf ?title . ?issue prism:volume ?volumeLiteral . ?issue prism:number ?issueLiteral . ?article prism:isPartOf ?issue . ?article prism:startingPage ?firstPageLiteral . FILTER ( ?volumeLiteral = "2" ) FILTER ( ?issueLiteral = "3" ) FILTER ( ?firstPageLiteral = "4" ) } SELECT ?auth WHERE{ ?auth foaf:firstName ?firstname . ?auth foaf:family_name ?surname . FILTER (?firstname = “Fred”) FILTER (?surname = “Flinstone”) } Example query using SPARQL for finding if an article already exists using literal metadata So we developed our loading programs using such queries and set it going.

Loading – a big problem 3 months Bottlenecks Insertion SPARQL queries Identifier retrieval is fast * So, wrote loading programs, tested them on small data sets, all nice * But.. as store grew, started to be concerned about performance * Calc on back of envelope – about 3 months .. problem... looked for ways to optimise * bottleneck at SPARQL queries like the one Priya showed you * also looking at our logs noticed a solution: identifier queries fast * identifier query I mean lookup via dc:ident of resource – give me subject whose object is this URI.

Loading – the messy workaround <struct:Article> <dc:identifier rdf:resource= "http://metastore.ingenta.com/content/11111111/v2n3/p4"/> </struct:Article> String id = "http://metastore.ingenta.com/content/11111111/v2n3/p4"; model.listSubjectsWithProperty(DC_11.identifier, model.getResource(id)); But costly! So: 1. we added predictable identifiers like this – predictable because made up of literals used in the query 2. And made loader use these where poss (falls back where <>1 result) * You are looking at this saying yuk. Breaks all database modelling rules.. - Have to update when update - also bloats store, something we're strictly avoiding - but we didn't think of a better option. A compromise, due to scale.

Experiences - Querying

How do we query the store? SPARQL vs RDQL SPARQL vs SQL Some lessons learnt while developing queries for applications Started with RDQL, but quickly found that we needed a richer query language while re- dev'ing client applications. So switched to SPARQL. So would advise people to start with SPARQL

Redeveloping queries SELECT ?article ?date ?hostingDesc WHERE { ?article rdf:type struct:Article . ?article struct:status http://metastore.ingenta.com/content/status/bare> . ?article dcterms:created ?date . OPTIONAL { ?hostingDesc rdf:type linking:HostingDescription . ?hostingDesc linking:hostedArticle ?article . ?hostingDesc linking:linkingPartner <http://www.crossref.org> } FILTER ( ?date > "2004-10- 06T00:00:00.109+01:00"^^<http://www.w3.org/2001/XMLSchema#dateT ime> ) FILTER ( ! bound(?hostingDesc) ) An example SPARQL query that we have developed. For finding a list of articles that we should poll CrossRef for ...we need to find those articles that we do not host ourselves and those that we have not already polled before. This kind of query richness was afforded by SPARQL and resultled in the query above.

SQL Equivalent SELECT refs.ref_id FROM sources, refs LEFT OUTER JOIN matches ON refs.ref_id=matches.ref_id WHERE sources.ref_id=refs.ref_id AND tags_doi IS NULL AND date_loaded > 20041006000000; This is the equivalent SQL query that was used by the client app while querying similar data from an RDBMS. If you contrast both queries, you will find that the SPARQL query is considerably more verbose and complex as compared to equivalent SQL. So we concluded that SPARQL provided us with the richness of language that we needed while developing queries. But it can quite a difficult language to get to grips with initially and can have quite a steep learning curve.

But what about IngentaConnect? We are supposed to provide a webservice to the front end team * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite

But what about IngentaConnect? We are supposed to provide a webservice to the front end team The good news: They want to do reasonably fixed queries. * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite

But what about IngentaConnect? We are supposed to provide a webservice to the front end team The good news: They want to do reasonably fixed queries. The bad news: They want 4 per second... SPARQL service?? * Last results to present today * We had modelled, and loaded, now coming to using store * First client is IngentaConnect website team * We aimed for 4 per sec * Informally noted differences in performance -> Formalise with formal test suite

Query Performance Testing SPARQL QUERY SELECT ?pub ?title ?issue ?article WHERE { ?title rdf:type struct:Title . ?title dc:identifier <http://metastore.ingenta.com/content/issn/11111111> . ?issue prism:isPartOf ?title . ?issue prism:volume ?volumeLiteral . ?issue prism:number ?issueLiteral . ?article prism:isPartOf ?issue . ?article prism:startingPage ?firstPageLiteral . FILTER ( ?volumeLiteral = "2" ) FILTER ( ?issueLiteral = "3" ) FILTER ( ?firstPageLiteral = "4" ) } IDENTIFIER QUERY String id="http://metastore.ingenta.com/content/articles/42"; Resource ires = model.getResource(id); So, 2 queries: * an identifier query like this ( without getting into Jena.. model.getResource..)

Query Performance Testing Jena 2.3 PostgreSQL 7 Debian 3.1 Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives in RAID5 - Ultra320 (15,000 rpm) 4G RAM DCIDENT: 23ms (/150m) SPARQL: 1.4s (/150m) Here are the results of the performance testing These are the test conditions – don't go through them Red line represents dc:identifier query results – give numbers Blue line represents SPARQL results – give numbers identifier based query not much affected by size of store, continues to perform well SPARQL degrades as store size increases

Query Performance Testing Identifier queries – 23ms SPARQL – 1.4 secs IngentaConnect – identifier Flexible development - SPARQL Some lessons we learnt Identifier queries perform very well SPARQL queries are not as fast, but still performwithin acceptable limits For real-time apps where query perf is critical eg. Webservices IngentaConnect – use identifier queries where possible For apps where you need a richer query lang than identifiers can afford , or where you can live with slower performance times eg, batchprocessing apps, SPARQL can be used as it is still workable

Conclusions Flexibility of RDF / RDFS helped us with an integration problem. RDBMS backend is good. We supplemented FOAF, DC, PRISM with Custom vocabs. Loading is really all about querying – if you want to do intelligent linking – which you do! Predictable identifiers – though nasty - improve query performance. SPARQL is handier than RDQL. But it is quite hard! Jena scaled to 200m triples. SPARQL performance OK. Compromises are unavoidable in modelling – eg weigh benefit against bloat.

The End Big Fat TripleStore