Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Semantic Web Introduction
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
NoSQL, No SQL!!, No, SQL? Raj Nair, Penton. Variety is the spice of life Key-Value stores Document stores ColumnFam ily Graph Hybrid Spice can lead to.
Automatic Scaling of Selective SPARQL Joins Using the TIRAMOLA System E. Angelou, N. Papailiou, I. Konstantinou, D. Tsoumakos, N. Koziris Computing Systems.
Triple Stores
Shujaat Hussain. A single column A single row.
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Object Naming & Content based Object Search 2/3/2003.
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Semantic Web Query Processing with Relational Databases Artem Chebotko Department of Computer Science Wayne State University.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Triple Stores.
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
Titan Graph Database Meet Bhatt(13MCEC02).
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Larisa kocsis priya ragupathy
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.
Next Generation of Apache Hadoop MapReduce Owen
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
MAINTAING PCJS IN RYA USING FLUO 1. Outline Background/Problem Statement Approach Demonstration Next Steps 2.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A Case Study in Building Layered DHT Applications
Hadoop.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CloudAnt: Database as a Service (DBaaS)
Presentation transcript:

Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012

Large-Scale Linked Data Management (Andreas) Motivation Preliminaries Apache Cassandra CumulusRDF Storage Layouts Storage Model Hierarchical Layout Flat Layout Evaluation Conclusion Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

MOTIVATION Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Linked Data Storage and Retrieval at Scale Algorithmic complexity Size Index Lookups SPARQLReasoning MB GB TB CumulusRDF (Apache Cassandra) Redland Jena Mem BigData, 4Store, YARS2 OWLIM SAOR Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data Single machine Distributed Batch (Map-Reduce) Runtime Pellet, HermiT Jena TBD Sesame CloudSPARQL

Linked Data Management RDF data accessible via HTTP lookups Many datasets cover descriptions of millions of entities Publishers often use full-fledged triple stores Complex query processing capabilities not necessary for Linked Data lookups Trend towards specialized data management systems tailored for specific use cases Distributed key-value stores Simple (often nested) data model No (expensive) joins High availability and scalability We investigate applicability of key-value stores for managing and publishing Linked Data Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Linked Data Lookups Dereferencing URI t should return RDF graph describing t Exact content is only lightly specified Common practice (e.g. DBpedia) is to return all triples with the given URI as subject and some triples with the given URI as object Other options Only triples with the given URI as subject Concise Bounded Descriptions Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data User Agent Server GETGET RDFRDF artists/191cba6a-b83f-49ca- 883c-02b20c7a9dd5#artist ba6a-b83f-49ca-883c-02b20c7a9dd5.rdf

Triple Patterns A triple pattern is an RDF triple that may contain variables instead of RDF terms in any position ?s dbpprop:birthPlace dbpedia:Karlsruhe. or ?s foaf:name ?o. Linked Data Lookup on t translates into two triple patterns lookups (t ? ?) (? ? t) At least three indexes to cover all possible triple patterns (with prefix lookups) Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data PatternsIndex ? ? ?Any s ? ?SPO ? p ?POS ? ? oOSP s p ?SPO ? p oPOS s ? oOSP s p oAny

Apache Cassandra Open source data management system Distributed key-value store (DHT-based) Nested key-value data model Schema-less Decentralized Every node in the cluster has the same role No single point of failure Elastic Throughput increases linearly as machines are added with no downtime Fault-tolerant Data can be replicated Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

CumulusRDF Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

CumulusRDF Functionality Distributed deployment to enable scale (more data and also more clients) by adding more machines (via Cassandra) Geographical replication (via Cassandra) Write-optimised indices with eventual consistency (via Cassandra) Triple pattern lookups (via CumulusRDF index structures) Linked Data Lookups (via CumulusRDF index structures) Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

STORAGE LAYOUTS Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Nested Key-Value Storage Model Column-only { row-key : { column : value } } Super columns { row-key : { supercolumn : { column : value } } } Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data roro c 00 v 00 c 01 v r1r1 c 10 v 10 c 11 v Row Columns Column keyColumn value r2r2 sc 00 sc 01 c 000 v 000 c 010 v r3r3 sc 00 sc 01 c 000 v 000 c 010 v Super column key

Nested Key-Value Storage Model Secondary indexes map column values to rows { value : row-key } Cassandra limitations Entire rows always stored on a single node No range queries on row keys Columns are stored in specified order and allow for range queries Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Hierarchical Layout Uses super columns RDF terms occupy row, supercolumn and column positions Value is empty Three indexes SPO, POS, OSP cover all possible triple pattern Example: SPO index SPO: { s : { p : { o : - } } } Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data dbp:Jaws foaf:name rdf:type “Jaws”- dbp:Film- dbp:Work- Row keySuper column keyColumn keyValue

Flat Layout Uses columns only Range queries on column keys allow prefix lookups Concatenate second & third position to form column key SPO { s : { po : - } } po is the concatenation of predicate and object For (sp?) we perform a prefix lookup on p in row with key s Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data dbp:Jaws foaf:name “Jaws”- rdf:type dbp:Film- rdf:type dbp:Work- Row keyColumn keyValue

POS Index RDF data is skewed: many triples may share the same predicate (rdf:type is a prime example) p as row key will result in a very uneven distribution Cassandra cannot split rows among several nodes We take advantage of Cassandra’s secondary indexes Use po as row key { po : { s : - } } Smaller rows, better distribution No range queries on rows key: no prefix lookup! In each row we add a special column ‘p’ which has p as its value { po : { ‘p’ : p } } Secondary index on column ‘p’ allows retrieval of all po row keys for a given p Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

EVALUATION Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

System: 4 node cluster on virtualized infrastructure 2 CPUs, 4GB RAM, 80GB disk per node Dataset: DBpedia 3.6 subset 120M triples (all w/o multilingual labels) Triple pattern queries 1M sampled S, SP, SPO, SO, and O patterns from dataset Output: all matching triples Linked Data lookup queries 2M resource lookups from DBpedia logs (1.2M unique) Output: all triples with URI as subject and 10k triples with URI as object Evaluation Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data 10k all C0 Clients CumulusRDF C1C2C3 C0-C3: Cassandra nodes

Results – Storage Layout Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data IndexNode 1Node 2Node 3Node 4Std. Dev.Max. Row SPO Hier SPO Flat OSP Hier OSP Flat POS Hier POS Sec Values in GB SPO Flat: { s : { po : - } }, OSP POS Sec: { po : { ‘p’ : p } } SPO Hier: { s : { p : { o : - } } }, OSP, POS

Results – Pattern Lookups Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Results – Pattern Lookups Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Results – Linked Data Lookups Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data

Conclusion We evaluated two index schemes for RDF on nested key- value stores to support Linked Data lookups Flat indexing gives best overall results Output format impacts performance (N-Triples v RDF/XML) Apache Cassandra is a viable alternative to full-fledged triple stores for Linked Data lookups Future work Automatic generation and maintenance of dataset statistics Evaluate insert and update performance Get CumulusRDF at Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman, Big Linked Data