JUC2006 Scaling Jena in a commercial environment The Ingenta MetaStore Project ● Purpose ● Give an example of a big, commercial app using Jena. ● Share.

Slides:



Advertisements
Similar presentations
Improving Human-Semantic Web Interaction: The Rhizomer Experience Roberto García and Rosa Gil GRIHO - Human Computer Interaction Research Group Universitat.
Advertisements

Building a Semantic IntraWeb with Rhizomer and a Wiki Roberto Garcia and Rosa Gil GRIHO (Human Computer Interaction Research Group) Universitat de Lleida,
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 IDF Annual Members Meeting June 23, 2004 IDF – Annual Members Meeting Implementation Update.
UKOLN is supported by: The JISC Information Environment Metadata Schema Registry (IEMSR): Update DC-2006, Manzanillo, Mexico October 3-6, 2006 Rachel Heery.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Semantic Web Introduction
© Copyright IBM Corporation 2014 Getting started with Rational Engineering Lifecycle Manager queries Andy Lapping – Technical sales and solutions Joanne.
 Copyright 2004 Digital Enterprise Research Institute. All rights reserved. SPARQL Query Language for RDF presented by Cristina Feier.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
Triple Stores
Engineering Village ™ ® Basic Searching On Compendex ®
Search Engines and Information Retrieval
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
AgriDrupal - a “suite of solutions” for agricultural information management and dissemination, built on the Drupal CMS; - the community of practice around.
1 Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies.
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
Semantic Web Bootcamp Dominic DiFranzo PhD Student/Research Assistant Rensselaer Polytechnic Institute Tetherless World Constellation.
Triple Stores.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Publishing data on the Web (with.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Scaling Jena in a commercial environment The Ingenta MetaStore Project Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
Search Engines and Information Retrieval Chapter 1.
1 SAMT’08 Semantic-driven multimedia retrieval with the MPEG Query Format Ruben Tous and Jaime Delgado Distributed Multimedia Applications Group (DMAG)
Presenting Statistical Data Using XML Office for National Statistics, United Kingdom Rob Hawkins, Application Development.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Semantic Web State of SemWeb Promotes flexibility, software reuse. SOA Styled architecture that exposes business processes and rules regarding IT.
 Open source RDF framework in Java.  Supports RDF Schema inferencing and querying.  Supports SPARQL 1.1 query, update, federated query.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
RDF and triplestores CMSC 461 Michael Wilson. Reasoning  Relational databases allow us to reason about data that is organized in a specific way  Data.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
1 SPARQL A. Emrah Sanön. 2 RDF RDF is quite committed to Semantic Web. Data model Serialization by means of XML Formal semantics Still something is missing!
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
RDF languages and storages part 1 - expressivness Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Practical RDF Chapter 10. Querying RDF: RDF as Data Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Lim.
Practical RDF Ch.10 Querying RDF: RDF as Data Taewhi Lee SNU OOPSLA Lab. Shelley Powers, O’Reilly August 27, 2004.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
05/01/2016 SPARQL SPARQL Protocol and RDF Query Language S. Garlatti.
RDF and Relational Databases
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
Interface for Glyco Vault Functionality and requirements. Initial proposal. Maciej Janik.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
VIVO architecture March 1, Major Components Vitro is a general-purpose Web-based application leveraging semantic standards VIVO is a customized.
Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.
External Data Access Adam Rauch, 6/05/08 Team: Geoff Snyder, Kevin Beverly, Cory Nathe, Matthew Bellew, Mark Igra, George Snelling.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Slug: A Semantic Web Crawler Leigh Dodds Engineering Manager, Ingenta Jena User Conference May 2006.
Database backed DNS.
Triple Stores.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Building and Managing a Massive Triplestore: an experience report
SPARQL SPARQL Protocol and RDF Query Language
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Building Search Systems for Digital Library Collections
Analyzing and Securing Social Networks
Triple Stores.
Cloud Web Filtering Platform
Semantic Annotation service
LitwareHR v2: an S+S reference application
Triple Stores.
Triple Stores.
Presentation transcript:

JUC2006 Scaling Jena in a commercial environment The Ingenta MetaStore Project ● Purpose ● Give an example of a big, commercial app using Jena. ● Share experiences and problems Presented at: Jena User Conference, May 2006, Bristol, UK. TODO:LINK TO PAPER...

JUC2006 What is the metastore? An RDF triple store which is : Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

JUC2006 Existing Systems 4.3 million Article Headers 8 million References Publishers/Title s Database IngentaConnect Website Preprints 20 million External Holdings Publishers Other Aggregators Article Headers, live system

JUC2006 What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

JUC2006 What is the metastore? An RDF triple store which is: Holds Ingenta's bibliographic metadata Centralised System Flexible Format Scalable Distributable Easily Integratable

JUC2006 Architecture of new system RDF Triplestore (PostgreSQL) Master Slave XML API (read only) (Jena) Primary Loader (Jena) JMS Queue IngentaConnect Other Clients JMS Queue Other Systems Customer Data Other loaders / enhancers

JUC2006 RDFS Modelling – What was the data anyway? Standard Vocabularies Dublin Core PRISM FOAF Custom Vocabularies Identifiers Structure Branding Some stats about schemas 28 Classes 72 Properties 4/18 th from Standard Vocabs

JUC2006

JUC2006 Journal XML, with highlights

JUC2006 Hosting description XML, with highlights

JUC2006 Example queries by client: IngentaConnect All properties of a book: PREFIX struct: PREFIX dc: PREFIX rdf: SELECT ?titleid ?prop ?val WHERE { ?titleid rdf:type struct:Book. ?titleid dc:identifier. ?titleid ?prop ?val } Browse by publisher: PREFIX rdf: PREFIX dc: PREFIX struct: SELECT ?pubid ?pubname WHERE { ?pubid rdf:type struct:Publisher. ?pubid dc:title ?pubname }

JUC2006 Example query by background process PREFIX dcterms: PREFIX struct: PREFIX dc: PREFIX linking: PREFIX rdf: SELECT ?article ?date ?hostingDesc WHERE { ?article rdf:type struct:Article. ?article struct:status. ?article dcterms:created ?date. OPTIONAL { ?hostingDesc rdf:type linking:hostingDescription. ?hostingDesc linking:hostedArticle ?article. ?hostingDesc linking:linkingPartner. } FILTER (! bound(?hostingDesc)) FILTER ( ?date > \ T14:35: :00\^^ ) } LIMIT 50 ;

JUC2006 OK, enough about your project, tell me something about Jena! OK.... ● How did we choose an RDF Engine? ● Why did we choose Jena? ● What problems did we have? ● Did we solve any of them? ● How did it scale?

JUC2006 How did we choose an RDF Engine? Experimented with Java APIS Jena + PostGreSQL Sesame + PostGreSQL Kowari + native Method of testing

JUC2006 Why did we choose Jena Relational Database backend Usability, Support Easy to debug Schemagen Scalablity

JUC2006 What problems did we have? 1. Insertion - performance 2. Ontologies – memory 3. Java classes – limiting flexibility (most problems due to scale..)

JUC2006 The Project - Scale Number of triples = ~200 million and keeps growing Size on disk = 65 Gb Result of loading 4.3 million articles and references Some details of database tables jena_long_lit – ~4.5 million records jena_long_uri - ~0.14 million records

JUC2006 Prob 1. Insertion performance * Task - load backdata * What does that actually involve? For each article: – Get metadata from database 1. – Add metadata from database 2. – Reform into new RDFS model – Query the store – look for relevant resources – Model.read * Problem * Possible Solutions? - Turn off index rebuild - Turn off duplicate checking - Batching

JUC2006 Our solution - Batching * What is batching? * Quantitative effect? * Costs

JUC2006 Prob 2. Ontologies – memory problems * Advantages of ontologies for us? * How did we start? * What was the problem? * Solutions?

JUC2006 Prob 3. Java Classes – limiting flexibility? * Not a problem with Jena/scale, but with industrial context * Encapsulate Jena code in DAOs * Java Interface hierarchy to mirror Schema * What is the problem with that? * Solutions?

JUC2006 Performance Testing SPARQL SELECT ?title ?issue ?article WHERE { ?title dc:identifier. ?issue prism:isPartOf ?title. ?issue prism:volume ?volumeLiteral. ?issue prism:number ?issueLiteral. ?article prism:isPartOf ?issue. ?article prism:startingPage ?firstPageLiteral. FILTER ( ?volumeLiteral = "20" ) FILTER ( ?issueLiteral = "4" ) FILTER ( ?firstPageLiteral = "539" ) } ?title rdf:type struct:Journal. ?issue rdf:type struct:Issue. ?article rdf:type struct:Article. NO TYPES QUERY TITLE TYPE QUERY ALL TYPES QUERY

JUC2006 Performance Testing SPARQL Title type only - <1.5 secs for 150 million triples TEST CONDITIONS Jena 2.3 PostgreSQL 7 Debian Intel(R) Xeon(TM) CPU 3.20GHz 6 SCSI Drives 4G RAM

JUC2006 Where are we now with the project? Recent Work * Loaded 4.3 million through batching process, ongoing in place * Non-journal content modelled * REST API Current Work * Replication * Phase out batching and use queues instead * SPARQL merging with external named graphs

JUC2006 Conclusions With a very large triple store: * Loading performance is a challenge * Inferencing is a challenge * SPARQL queries need TLC * Jena scales to 200 million triples * Jena is a good choice for a commercial triplestore

JUC2006 The End