RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
CS4432: Database Systems II
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
RCQ-ACS: RDF Chain Query Optimization Using an Ant Colony System WI 2012 Alexander Hogenboom Erasmus University Rotterdam Ewout Niewenhuijse.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Fast Algorithms For Hierarchical Range Histogram Constructions
Knowledge Graph: Connecting Big Data Semantics
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
Physical Database Monitoring and Tuning the Operational System.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
Access Path Selection in a Relation Database Management System (summarized in section 2)
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Search Engines and Information Retrieval Chapter 1.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Hexastore: Sextuple Indexing for Semantic Web Data Management
Lecture 9 Methodology – Physical Database Design for Relational Databases.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Lesley Charles November 23, 2009.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Methodology – Physical Database Design for Relational Databases.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
RDF-3X : RISC-Style RDF Database Engine
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Practical Database Design and Tuning
CC La Web de Datos Primavera 2017 Lecture 7: SPARQL [i]
Database Management System
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
RDF-3X: a RISC style Engine for RDF
Overview of Query Optimization
Database Management Systems (CS 564)
Spatial Online Sampling and Aggregation
CC La Web de Datos Primavera 2016 Lecture 7: SPARQL (1.0)
Practical Database Design and Tuning
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
Semantic-Web, Triple-Strores, and SPARQL
Presentation transcript:

RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)

Motivation RDF(Resource Description Framework ) is schema-free structured information. Increasingly popular in context of – Knowledge bases – Semantic Web – Life-Sciences and Online communities. Managing large-scale RDF data includes challenges for – Storage layout, indexing and querying

Overview Introduction to RDF and SPARQL Storage of RDF data Query Translation and Processing Query Optimization Evaluation Conclusion

Introduction to RDF Standard model for data interchange on the Web. Allows structured and semi-structured data to be mixed, exposed, and shared. Conceptually a labeled graph – Linking structure forms a directed labeled graph where the edges represent the named link between two resources that are represented by the graph nodes. Graph is stored as collection of facts. Each edge represents a fact (triple in RDF notation) – Triples have the form (subject; predicate; object)

RDF as labeled graph Extends the linking structure of the Web by using URIs for relationship. Subjects and predicates are identified by URI values. Object can be another URI or a value (literal).

RDF Example: Conceptual View id1 id11 id2 release Year directedBy hasTitle roleName hasCastin g actor hasName 2008 Slumdog Millionaire Danny Boyle Freida Pinto id7 hasName Latika

RDF Eample: Facts in triple form  (id1, hasTitle, " Slumdog Millionaire "),  (id1, releaseYear, "2009"),  (id1, directedBy,id7)  (id7,hasName,“Danny Boyle"),  (id1, hasCasting, id2),  (id2, roleName, “Latika"),  (id2, actor, id11),  (id11, hasName, " Freida Pinto"),and so on… RDF data is a (potentially huge) set of triples – 585 million triples – Size of data in Freebase – 120 million facts of 10 million entities in YAGO2

Introduction to SPARQL SPARQL is used to query over RDF data. Result can be result sets or RDF graphs. SPARQL query for “The titles of all movies having Freida Pinto“ can be: Select ?title Where { ?p ?title. ?p ?s. ?s ?c. ?c “Freida Pinto“ } From the prevous example triples: ?c : id11, ?s : id2, ?p : id1 and ?title : " Slumdog Millionaire "

SPARQL- Join graph Each SPARQL query can be represented by a join graph. A possible join tree for the previous query: Where P 1 = (?p ?title), P 2 = (?p ?s), P 3 = (?s ?c) and P 4 =(?c “Freida Pinto“ ) are triple patterns.

SPARQL: Query features FROM clause to select a data set PREFIX clause for Namespace Prefixes WHERE clause supports – Star-shaped query – Long join path query – FILTER to restrict values by patterns/conditions – Union query – Optional query For Result: ORDER BY, DISTINCT, CONSTRUCT, DESCRIBE and ASK clause

Problems in querying over RDF RDF data storing, indexing and query processing is non-trivial: – Absence of global schema. – Very fine grained data items. instead of records or entities. – Execution plan optimization require statistics which is unsuitable for RDF due to no schema. – Physical design difficult as RDF triples form graph rather than a tree as in XML.

Solution Proposed RDF-3X (RDF Triple eXpress), a RISC style execution engine based on three principles: – Physical design is workload independent. With exhaustive compressed indexes it eliminates need for physical-design tuning. – Query processor rely mostly on merge joins over sorted index lists. – Query optimizer focuses on join order in the execution plan.

Storage of RDF data- Raw Raw RDF facts i.e set of triples are as shown Literals can be very large and contains lot of redundancy

Storage of RDF data- Dictionary Compression First step to reduce data is to provide ID to literals :Dictionary Compression

Storage of RDF data- RDF3X approach Store everything in a clustered B + -Tree – Triples sorted in lexicographical order which allows SPARQL pattern into range scans. – Can be compressed well (delta encoding). – Efficient scan, fast lookup if prefix is known. – Structure of byte-level compressed triple is Header value 1 value 2 value 3 GapPayload 1 Bit7 Bits Delta 0-4 Bytes Delta 0-4 Bytes Delta 0-4 Bytes

Storage of RDF data- RDF3X approach Header byte denotes number of bytes used by the three values. (5*5*5=125 size combinations) Gap bit is used when only value 3 changes and delta is less than 128 (that fits in in header) Which sort order to choose? – 6 possible orderings, store all of them (SPO, SOP, OSP, OPS, PSO, POS) – Will make merge joins very convenient Each SPARQL triple pattern can be answered by a single range scan.

Storage of RDF data- Aggregated Indices Sometimes we do not need the full triple: – Are object 4 and object 13 related? (by any predicate). Maintain aggregated indexes with 2 out of the three columns in triple. Six additional indexes (SP, PS, SO, OS, PO, OP) – Count is necessary for the third. Example: How many author annotations does object 14 have? – Aggregated index stores (value 1, value 2, count) – Much smaller than full index

Storage of RDF data- Aggregated Indices (2) Finally three one-value indexes (S, P, O) – Store (value 1, count) entries. Rare but size is very small. Can afford another 6 two-value indexes and 3 one-value indexes as the full triple index is compressed. Experimentally total size of all indexes is less than original data. Smaller index provides faster scan and improves query performance significantly.

Query Translation and Processing SPARQL query is transformed into calculus. Each conjunctive query can be parsed into a set of triple patterns with each component either a literal (mapped to ids) or a variable. Each triple pattern becomes an index scan. With multiple triple patterns. Patterns with common variable induces joins. All order indexes sorted in lexicographical order makes merge joins very attractive.

Query Translation and Processing (2) Each triple corresponds to a node in query graph. Employ join ordering on query graph. Cardinality of result is preserved (as per standard SPARQL semantics) using multiplicity as reported in aggregated index. Count=1 for unaggregated. Disjunctive queries (UNION and OPTIONAL) are treated as nested subqueries and the results as base relation with special cost.

Query Optimization Properties of SPARQL queries: – Star-shaped subqueries. (Star joins for an entity) – Star often occur at start and end of long join paths. Key Issue : Join ordering. Two step process: – First, if a variable is unused, it can be projected away by using an aggregated index (preserving cardinality through count information). – In the second step the optimizer decides which of the applicable indexes to use. It focuses on optimizing join order in its query execution plans. 21

Query Optimization – Selectivity Estimates Decision cost based, dynamic programming strategy. A bit different from standard join ordering: – one big "relation", no schema – selectivity estimates are hard Standard single attribute synopses are not very useful: – Only three attributes and one big relation – But (?a, ?b, ”Mumbai”) and (?a, ?b, ” ”) produces vastly different values for ?a and ?b Estimated cardinalities have huge impact on performance. Two strategies proposed for selectivity estimation: Selectivity Histogram and Frequent Paths

Selectivity Estimates- Selectivity Histogram Selectivity histogram (uses aggregated indexes): Generic but assumes predicates are independent. Aggregate indexes until they fit into one page Merge smallest buckets ( equi-depth) For each bucket (i.e. triple range) compute statistics 6 indexes, pick the best for each triple pattern Assumes uniformity and independence, but works quite well

Selectivity Estimates- Selectivity Histogram (2) Example: bucket with (subject, predicate, object) statistics Estimations: (10,4,?a) => 1000 triples (10,4,?a), (?a,?b,?c) => 2000 triples range(10,2,30) - (10,5,12000) Length123 #prefixes of length SubjectPredicatObject #subject joins with #predicate joins with #object joins with

Selectivity Estimates- Frequent Paths Still issues with (common) large correlated join patterns: – navigation: {(?a,[],?b),(?b,[],?c),(?c,[],?d)} (chain) – selection: {(?a,[],?b),(?a,[],?c),(?a,[],?d)} (star) Frequent paths (pre-processed): Computes frequent join paths and gives accurate predictions for these long frequent joins. Capture common correlations: – mine the most frequent paths (chains and stars) and count – exact prediction or an upper bound for these paths. Not as easily applicable as histograms, but very accurate

Evaluation RDF-3X is compared with: – MonetDB (column store approach) – PostgreSQL (triple store approach) Three different data sets: – Yago, Wikipedia-based ontology: 1.8GB – LibraryThing : 3.1 – Barton library data : 4.1GB Same setup for all : – Same preprocessing – Same dictionary – Equivalent queries

Evaluation - Yago sample query(B2) : select ?n1 ?n2 where { ?p1 ?n1. ?p1 ?city. ?p1 ?p2. ?p2 ?n2. ?p2 ?city }

Evaluation – Library Thing sample query(B3): select distinct ?u where { ?u [] ?b1. ?u [] ?b2.?u [] ?b3.?b1 [].?b2 []. ?b3 [] }

Evaluation – Barton Data Set [VLDB07] sample query (Q5) select ?a ?c where { ?a. ?a ?b. ?b ?c. filter (?c != ) }

Conclusion Avoids physical design tuning, generic storage of all orders and aggregated indexes. Exhaustive triple indexes but due to compression overall cost is same as original database. Estimation of cardinalities has a huge impact on query optimization. Full paper includes managing updates: RDF and SPARQL standards do not include updates so far. Optimization using SIP(Sideways Information Passing) and improved selectivity estimates in Newmann and Weikum[SIGMOD’09] “Scalable Join Processing on Very Large RDF Graphs”.