Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.

Slides:



Advertisements
Similar presentations
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Advertisements

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Use Case: Populating Business Objects.
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Semantic Web Introduction
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Triple Stores
Semantic Web Tools Vagan Terziyan Department of Mathematical Information Technology, University of Jyvaskyla ;
Engineering Village ™ ® Basic Searching On Compendex ®
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Presented by Gentre Dozier and Spencer Dille management.com/newsletters/database_metadata_unstructured_data_triple_store html.
AgriDrupal - a “suite of solutions” for agricultural information management and dissemination, built on the Drupal CMS; - the community of practice around.
1 Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies.
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Triple Stores.
Vocabulary Services “Huuh - what is it good for…” (in WDTS anyway…) 4 th September 2009 Jonathan Yu CSIRO Land and Water.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Publishing data on the Web (with.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
Practical RDF Chapter 1. RDF: An Introduction
Event-Condition-Action Rule Languages over Semistructured Data George Papamarkos.
Towards linked sensor data Analysis of project task, tools and Hackystat architecture Author: Myriam Leggieri GSoC 2009 project for Hackystat.
Universität Innsbruck Leopold Franzens  Copyright 2007 DERI Innsbruck EASAIER 18 Month Coordination Meeting, Tel Aviv, Israel WP 2 – Media.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Goodbye rows and tables, hello documents and collections.
A centre of expertise in digital information management The MEG Metadata Schemas Registry Pete Johnston, Research Officer (Interoperability),
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Semantic Web State of SemWeb Promotes flexibility, software reuse. SOA Styled architecture that exposes business processes and rules regarding IT.
 Open source RDF framework in Java.  Supports RDF Schema inferencing and querying.  Supports SPARQL 1.1 query, update, federated query.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
More about Databases. Data Entry through Forms Table View (Data sheet view) is useful for data entry of new records But sometimes customization would.
RDF and triplestores CMSC 461 Michael Wilson. Reasoning  Relational databases allow us to reason about data that is organized in a specific way  Data.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
RDF and Relational Databases
Triple Storage. Copyright  2006 by CEBT Triple(RDF) Storages  A triple store is designed to store and retrieve identities that are constructed from.
Interface for Glyco Vault Functionality and requirements. Initial proposal. Maciej Janik.
Introduction to the Semantic Web Jeff Heflin Lehigh University.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
JUC2006 Scaling Jena in a commercial environment The Ingenta MetaStore Project ● Purpose ● Give an example of a big, commercial app using Jena. ● Share.
WORKSHOP ON INFORMATION SYSTEMS ARCHITECTURES Information / software architectures based on Content Management Systems (CMS): some examples with Drupal.
Database backed DNS.
Triple Stores.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Building and Managing a Massive Triplestore: an experience report
PNUTS: Yahoo!’s Hosted Data Serving Platform
WORKSHOP ON INFORMATION SYSTEMS ARCHITECTURES Information / software architectures based on Content Management Systems (CMS): some examples with Drupal.
Provisioning Performance of name server Software
Triple Stores.
More about Databases.
LitwareHR v2: an S+S reference application
Triple Stores.
Triple Stores.
Presentation transcript:

Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences and problems

What is the metastore? An RDF triple store which is : Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

Existing Systems 4.3 million Article Headers 8 million References Publishers/Titles Database IngentaConnect Website Preprints 20 million External Holdings Publishers Other Aggregators Article Headers, live system

What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

Architecture of new system RDF Triplestore (PostgreSQL) Master Slave XML API (read only) (Jena) Primary Loader (Jena) JMS Queue IngentaConnect Other Clients JMS Queue Other Systems Customer Data Other loaders / enhancers

RDFS Modelling – What was the data anyway? Standard Vocabularies Dublin Core PRISM FOAF Custom Vocabularies Identifiers Structure Branding Some stats about schemas 28 Classes 72 Properties 4/18 th from Standard Vocabs

Journal XML, with highlights

Hosting description XML, with highlights

OK, enough about your project, tell me something about Jena! OK.... ● How did we choose an RDF Engine? ● Why did we choose Jena? ● What problems did we have? ● Did we solve any of them? ● How did it scale?

How did we choose an RDF Engine? Experimented with Java APIS Jena + PostGreSQL Sesame + PostGreSQL Kowari + native Method of testing

Why did we choose Jena Relational Database backend Usability, Support Easy to debug Schemagen Scalablity

What problems did we have? 1. Insertion - performance 2. Ontologies – memory 3. Encapsulation – limiting flexibility (most problems due to scale..)

The Project - Scale Number of triples = ~200 million and keeps growing Size on disk = 65 Gb Result of loading 4.3 million articles and references Some details of database tables jena_long_lit – ~4.5 million records jena_long_uri - ~0.14 million records

Prob 1. Insertion performance * Task - load backdata * What does that actually involve? For each article: –Get metadata from database 1. –Add metadata from database 2. –Reform into new RDFS model –Query the store – look for relevant resources –Model.read * Problem * Possible Solutions? - Turn off index rebuild - Turn off duplicate checking - Batching

Our solution - Batching * What is batching? * Quantitative effect? * Costs

Prob 2. Ontologies – memory problems * Advantages of ontologies for us? * How did we start? * What was the problem? * Solutions?

Prob 3. Encapsulation – limiting flexibility * Not really a problem with Jena – an experience * Why are we encapsulating the Jena code? * What is the problem with that? * Solutions?

Performance Testing SPARQL Standard query – TITLE TYPE QUERY SELECT ?title ?issue ?article WHERE { ?title rdf:type struct:Journal. ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) }

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } NO TYPES QUERY

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. TITLE TYPE QUERY

Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. ?issue rdf:type struct:Issue. ?article rdf:type struct:Article. ALL TYPES QUERY

Performance Testing SPARQL Title type only - <1.5 secs for 150 million triples TEST CONDITIONS Jena 2.3 PostgreSQL 7 Debian Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM

Where are we now with the project? Recent Work * Loaded 4.3 million through batching process, ongoing in place * Non-journal content modelled * REST API Current Work * Replication * SPARQL merging * Phase out batching and use queues instead

Conclusions With a very large triple store: * Loading performance is a challenge * Inferencing is a challenge * SPARQL queries need TLC * Jena scales to 200 million triples * Jena is a good choice for a commercial triplestore

The End