BBC Linked Data Platform Profile of Triple Store usage & implications for benchmarking.

Slides:



Advertisements
Similar presentations
Implications of Release 3 of the COUNTER Code of Practice Vendor Usage Reports: Are we all on the same page now? Charleston Conference November 6, 2008.
Advertisements

Copyright 2006 Digital Enterprise Research Institute. All rights reserved. MarcOnt Initiative Tools for collaborative ontology development.
PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
V-One Docs. V-One Docs The paper industry is the 4th largest contributor to greenhouse gas emissions among United States.
Chapter 20 Oracle Secure Backup.
MODULE 3: OS & APP LAYERS. Agenda Preparing and importing a gold image Creating and understanding Install Machines Creating basic Application layers Understanding.
Configuration Management
MLA Dataset Analyser solution 19 March 2008 Daniel Britton – Business analyst.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
What’s New in BMC ProactiveNet 9.5?
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Information Retrieval in Practice
PHP (2) – Functions, Arrays, Databases, and sessions.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
© 1Spatial All rights reserved. An Internet of Places Making Location Data Pervasive Paul Watson Giuseppe Conti* Federico Prandi*
CSC 351 FUNDAMENTALS OF DATABASE SYSTEMS
Overview of Search Engines
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Triple Stores.
Object Oriented Databases by Adam Stevenson. Object Databases Became commercially popular in mid 1990’s Became commercially popular in mid 1990’s You.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Facets of Curriculum Modelling Mike Collett EDRENE Conference Den Haag December 2012.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
© VESP International Pty Limited To Contents Slide CLICK to advance slides/ bullet points within slides Integrated Master Planner An Overview.
SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF Web link:
Archival Integration with Neo4j Mike Bryant Centre for e-Research King’s College London.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
The GRIMOIRES Service Registry Weijian Fang and Luc Moreau School of Electronics and Computer Science University of Southampton.
Digital Enterprise Research Institute HADA – An Access Controlled Application for Publishing and Discovering Linked Government Data Owen Sacco.
Maintaining File Services. Shadow Copies of Shared Folders Automatically retains copies of files on a server from specific points in time Prevents administrators.
1Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8 Contract Management.
An RDF and XML Database John Snelson, Lead Engineer 23 rd October 2013.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
Developing software and hardware in parallel Vladimir Rubanov ISP RAS.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
Module 9: Implementing Caching. Overview Caching Overview Configuring General Cache Properties Configuring Cache Rules Configuring Content Download Jobs.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
FHIR Server Design Review Brian Postlethwaite HEALTHCONNEX October 2015.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Ing. Erick López Ch. M.R.I. Replicación Oracle. What is Replication  Replication is the process of copying and maintaining schema objects in multiple.
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
CSC 351 FUNDAMENTALS OF DATABASE SYSTEMS. LECTURE 1: INTRODUCTION TO DATABASES.
Getting Your Content in the Penn State Student Portal Presented By James Leous, Program Manager James Vuccolo, Lead Research Programmer.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
On Demand RDF Databases in the Cloud Presentation for the Ontology Forum March 3, 2016.
Krum Haesli, BotsBits SA Implementing Digital Asset Management with SharePoint 2013.
Nuts and Bolts of Your LawHelp Site February 2014.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Conceptual Overview For Understanding the New Paradigm Provided by: Web Services Section.
Introduction to CMIS, an Electronic Design Change Process Presented by: Mark Gillis (FirstEnergy) and Brad Diggans (Rolls-Royce)
TOPSpro Special Topics VI:TOPSpro for Instructors.
Developing Linked Data Applications
RoMEO and CRIS Technical Issues & Efficiency Tips
Presented at Archives Records 2016, session 510
Wildfire Goals HTAP: transactions & queries on same data Open Format
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Migration to SharePoint 2013
Storage & Digital Asset Management CIO Council Update
Triple Stores.
Semantic Annotation service
LOD reference architecture
Presentation transcript:

BBC Linked Data Platform Profile of Triple Store usage & implications for benchmarking

What we use OWLIM Enterprise Current version 3.5 (SPARQL 1.0) Imminent upgrade to 5.3 (SPARQL 1.1) Dual Data Centre comprising 6 replicated triple stores

LDP in 1 slide Using Linked Data to join-up the BBC News, TV, Radio, Learning… Across common concepts London, Tony Blair, Tigers On content creation/update: Meta-data published to Triple Store, including tags Tag = content URI -> predicate -> concept URI SPARQL queries power user experience 10 most recent content items about Wales Most recent News Article for each team in the Premier League

Data inputs & outputs

High-level architecture

Update: Resource Resource = Geo Location, Politician, 2016 Olympics i.e. concepts or things that can be used in tags 90% Creation 10% Update Variable data structure Small data volume < 100 statements SPARQL 1.1 Update Frequent (10,000/hour) Bursts in response to periodic update Bursts in response to bulk loading Low level of manual updates Medium latency requirement

Update: Resource DROP GRAPH ; INSERT DATA { GRAPH { any rdf data } Note: idempotency

Update: Creative Works Creative Work = News Article, TV Programme, Recipe etc… 99% Creation 1% Update Uniform data structure Currently Sesame Imminently: SPARQL 1.1 Update Frequent (100/hour) Occurs in response to action by content creator E.g. Journalist publishes new news article Caveat Bootstrapping of bulk content E.g. Archive Low latency requirement

Update: Creative Works DROP GRAPH ; INSERT DATA { GRAPH { a cwork:CreativeWork ; cwork:title "All about Linked Data" ; cwork:dateModified " T14:56:01+00:00"^^xsd:dateTime ; cwork:about ; cwork:mentions ; cms:locator ; bbc:primaryContentOf ; bbc:primaryContentOf. bbc:webDocumentType. bbc:webDocumentType. a cms:Locator ; cms:locatorType cms:CPS. }

Update: Dataset Dataset = A grouping of resources that are managed as a single serialised, versioned file 10% Creation 90% Update Variable data structure SPARQL 1.1 Update Infrequent (10/hour) Low level of manual updates Higher data volume: current limit is 1MB Medium latency requirement Legacy solution?

Update: Dataset DROP GRAPH ; INSERT DATA { GRAPH { any rdf data up to 1Mb } Note: idempotency

Update: Ontology 10% Creation 90% Update Restricted to ontological statements SPARQL 1.1 Update Infrequent (1/hour) Low level of manual updates Low data volume Medium latency requirement Conflict: high impact change vs. versioning Solution: difference analysis? Solution: maintain separately with semi-automatic change

Update: Ontology DELETE DATA { GRAPH { statements to delete } INSERT DATA { GRAPH { statements to insert }

Domain queries Queries that touch on one of our domains E.g. Most recent news article for each Premier League team E.g. All Key Stages in the English National Curriculum Variable size & complexity Variable caching Variable approaches to efficiency Efficiency is not always the priority Efficiency is hard to gauge Accurate metric dependent on the full graph

Creative Work Queries Standard SPARQL template Variable use of parametisation Geo filter Tag filter (about, mentions) Creation-time filter Performance extremely dependent on full data High performance in testing Low performance in production Many thousands of requests/sec Our principal query

Creative Work Query Filters {{#about}} FILTER (?about = ). ?creativeWork cwork:about ?about. {{/about}} {{#format}} FILTER (?format = cwork:{{format}}). ?creativeWork cwork:primaryFormat ?format. {{/format}} {{#mentions}} FILTER (?mentions = ). ?creativeWork cwork:mentions ?mentions. {{/mentions}} {{#audience}} OPTIONAL { ?creativeWork cwork:audience ?audience. } FILTER (?audience = || NOT EXISTS { ?creativeWork cwork:audience ?audience } ). {{/audience}} {{#within}} ?creativeWork cwork:tag ?location. ?location a geoname:Feature ; omgeo:within( {{within}} ). {{/within}}

Fundamental changes Fundamental changes need to be fast in production Ruleset changes Configuration/administrative changes Index creation/update Re-indexing Memory allocation Naming Dumping and restoring data can support this Other approaches?

Finally Most important part of the BBC use-case: We need 99.99% availability of reads We need 99% availability of writes We need 99.99% availability of writes during critical periods Ontologies and rules can and should change over time Changes to these must limit their effect on: Availability Latency Our approaches are constantly evolving