Store RDF Triples In A Scalable Way 2012-03-23 Liu Long & Liu Chunqiu.

Slides:



Advertisements
Similar presentations
Applied Temporal RDF: Efficient Temporal Querying using SPARQL Jonas Tappolet and Abraham Bernstein ESWC 2009.
Advertisements

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
Titan Graph Database Meet Bhatt(13MCEC02).
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
Lecture 8 Index Organized Tables Clusters Index compression
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Hexastore: Sextuple Indexing for Semantic Web Data Management
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Object Persistence (Data Base) Design Chapter 13.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
RDF-3X : RISC-Style RDF Database Engine
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Session 1 Module 1: Introduction to Data Integrity
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
SQL Basics Review Reviewing what we’ve learned so far…….
Bigtable A Distributed Storage System for Structured Data.
CS 405G: Introduction to Database Systems
Column-Based.
Indexing Structures for Files and Physical Database Design
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance The ProvBase System Artem Chebotko (joint work with.
Physical Database Design and Performance
NOSQL.
NOSQL databases and Big Data Storage Systems
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 15 QUERY EXECUTION.
RDF Stores S. Sakr and G. A. Naymat.
Physical Database Design
Presentation transcript:

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu

Scalable RDF Store Based on HBase and MapReduce Liu Chunqiu

OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

RDF Schema Example In the example above, the resource "horse" is a subclass of the class "animal". ex:horse rdfs:subClassOf ex:animal

OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

Physical Designs for RDF Storage (1/4) Giant Triples Table SELECT ?title WHERE { ?book ?title. ?book. ?book }  Entire Table Scan!  Redundancy!

Physical Designs for RDF Storage (2/4) Clustered Property Table Contains clusters of properties that tend to be defined together

Physical Designs for RDF Storage (3/4) Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Values of the type property

Physical Designs for RDF Storage (4/4) Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to Maintain null values Have a certain clustering algorithm subject property object

OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

HBase A data row = a sortable rowkey + an arbitrary number of columns Columns are grouped into column families A data cell can hold multiple versions which are distinguished by timestamp HBase is like a multi-dimensional sorted map : (Row Key, Family : Column,Timestamp )-----> Value Row Key: Key1 { Column Family A { Column X: T1 Value1 T2 Value2 Column Y: T3 Value3 } Column Family B { Column Z: T4 Value4 1.Data stored in the same column family are stored together on file system,while might be distributed 2.Hbase provides a B+ tree-like index on row key by default 3.Secondary Indexes on columns can also be created with extra efforts

MapReduce

OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

Scalable RDF Store Based on HBase and MapReduce Present a RDF storage schema on HBase. The storage schema considers characteristics of both RDF data model and HBase structure Present a MapReduce algorithm for SPARQL Basic Graph Pattern Processing Evaluation

The Storage Schema on HBase build up six tables to store RDF triples : S_PO,P_SO, O_SP, PS_O, SO_P and PO_S Data are stored in row keys and column names These six tables have different row key and column name combinations Row Key: Subject URI I { Column Family: VALUE { Column: (predicate1, object1) Column: (predicate2, object2) Column: (predicate3, object3) } S_PO data row Column value is null !

The Storage Schema on HBase P1: (s, p, o) P2: (?s, p, o) P3: (s,? p, o) P4: (s, p,? o) P5: (?s,? p, o) P6: (s, ?p,? o) P7: (?s, p,? o) P8: (?s,? p,? o) S_PO P_SO O_SP SP_O SO_P PO_S P1 and P8 can be handled by any table

The Storage Schema on HBase Advantages Support for multi-valued properties Support for parallel processing Reduced I/O cost Suitable for HBase Disadvantages Requires more storage space The update operation will be more complicated

Query Processing SELECT ?x ?y ?z WHERE { ?x rdf:type ub:GraduateStudent. ?y rdf:type ub:University. ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y. ?x ub:undergraduateDegreeFrom ?y. } LUBM Q2 ( x, y, z ) x y (x, y)(x, y, z) (y, z) (x, z) z Join on x Join on y Join on (x, y) Join on z A MapReduce job

Process of MapReduce Assign unique ids to all triple patterns Retrieve results for all triple patterns and persist results to file system Calculate expressions for triple patterns and final result Create and submit the MapReduce job Mappers receive input (k, v) pairs Reducer retrieve (k, v) pairs Input of mapper : (1, Department0) (2, (Student0, Department0)) (3, (Department0, University0)) Output Of mapper: (Department0, (1, )) (Department0, (2, Student0)) (Department0, (3, University0)) (a) Output of reducer: (Student0,University0, Department0) (b) ? x ub: memberOf ?z (x,z) ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y.

Process of MapReduce Notice: If the query doesn't need to join, then there will not be any MapReduce job LUBM Query6 SELECT ?X WHERE { ?X rdf:type ub:Student } NO need to join!!!! The result of the only triple pattern will be directly used as final output

Evaluation

UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query

Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query Q1 is the fastest among rest queries

Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query Q1 is the fastest among rest queries Performance result against LUBM(20, 0) is not quite efficient Compare results of LUBM(20,0)and LUBM(100,0),MapReduce shows great scalability when data size grows

OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

Conclude and Think This approach works against large RDF dataset Waste a lot of storage space MapReduce algorithm need to be more efficient Enrich storage schema Use data compression Reduce execution steps during query processing Integrate with other MapReduce algorithm SO