RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
Physical Database Monitoring and Tuning the Operational System.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
Graph Data Management Lab, School of Computer Scalable SPARQL Querying of Large RDF Graphs Xu Bo
Access Path Selection in a Relation Database Management System (summarized in section 2)
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
Database Techniques for Linked Data Management SIGMOD 2012 Andreas Harth 1, Katja Hose 2, Ralf Schenkel 2,3 1 Karlsruhe Instititute of Technology 2 Max.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Hexastore: Sextuple Indexing for Semantic Web Data Management
Lecture 9 Methodology – Physical Database Design for Relational Databases.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Methodology – Physical Database Design for Relational Databases.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
RDF-3X : RISC-Style RDF Database Engine
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
Module 11: File Structure
Database Management System
Methodology – Physical Database Design for Relational Databases
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
RDF-3X: a RISC style Engine for RDF
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Overview of Query Evaluation
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Presentation transcript:

RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08 May Presented by Somin Kim

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 2/30

Introduction (1/3) Motivation and Problem  RDF (Resource Description Framework) –A flexible representation of schema-free information for the semantic web –(subject, predicate, object) or (subject, property, value) –All RDF triples together can be viewed as a large graph –The notion of RDF triples fits well with “pay as you go” philosophy 3/30

Introduction (2/3) Motivation and Problem  Technical challenges for managing large-scale RDF data –Physical database design –Prediction of join attributes –Suitable granularity of statistics gathering –RDF triples form a graph rather than a collection of trees 4/30

Introduction (3/3) Contribution and Outline  RDF-3X (RDF Triple eXpress) –A novel architecture for RDF indexing and querying, eliminating the need for physical database design  Key principles of RDF-3X –Physical design is workload-independent  By creating appropriate indexes over a single, giant “triple table” –The query processor is RISC-style  By relying mostly on merge joins over sorted index lists –The query optimizer employs dynamic programming for plan enumeration 5/30

Outline  Introduction  Background and State of the Art –SPARQL –Related Work  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 6/30

Background and State of the Art (1/4) SPARQL  The official standard for searching over RDF storages  Each pattern consists of S, P, O and each of these is either a variable or a literal  Two query modifiers of SPARQL –distinct keyword : duplicates must be eliminated –reduced keyword : duplicates may but need not be eliminated SELECT ?var1 ?var2… WHERE { pattern1. pattern2. … } SELECT ?var1 ?var2… WHERE { pattern1. pattern2. … } SELECT ?title WHERE { ?m ?title; ?c. ?c ?a. ?a “Johnny Depp” } SELECT ?title WHERE { ?m ?title; ?c. ?c ?a. ?a “Johnny Depp” } 7/30

Background and State of the Art (2/4) Related Work  Triple table –All triples are stored in a single table SELECT ?title WHERE { ?book ?title. ?book. ?book } 8/30 Based on JS Myoung’s presentation slide

Background and State of the Art (3/4) Related Work  Property table –Triples are grouped by their predicate name subject property object 9/30

Background and State of the Art (4/4) Related Work  Cluster-property table –Triples are clustered by properties that tend to be defined together 10/30

Outline  Introduction  Background and State of the Art  Storage and Indexing –Triple Store and Dictionary –Compressed Indexes –Aggregated Indexes  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 11/30

Storage and Indexing (1/7) Triples Store and Dictionary  RDF-3X is based on a single, giant “triples table”  Mapping Dictionary –Replacing all literals by ids using a mapping dictionary –It compresses the triple store by containing only id triples SPO object214hasColorblue object214belongsToobject352 ……… SPO ……… IDValue 0object214 1hasColor …… 12/30

Storage and Indexing (2/7) Triples Store and Dictionary  Store all triples in a clustered B + -tree –Triples are sorted lexicographically –It allows the conversion of SPARQL patterns into range scans 002… IDValue 0object214 1hasColor …… SPO ……… Actually, we don’t need this table! ( literal1, literal2, ?x ) 13/30

Storage and Indexing (3/7) Compressed Indexes  We relied on the fact that the variables are a suffix – - - ?var or -?var1 -?var2  To guarantee that we can answer every possible pattern with variables in any position by merely performing a single index scan, we maintain all six permutations of S, P and O in six separate indexes –(SPO, SOP, OSP, OPS, PSO, POS) –We can afford this level of redundancy ?var /30

Storage and Indexing (4/7) Compressed Indexes  Instead of storing full triples, we only store the changes between triples –The collation order causes neighboring triples to be very similar  We use a byte-level compression scheme –The algorithm computes the delta to the previous tuple –If delta is small, it is directly encoded in the header byte –Otherwise, it computes the delta value, write the header byte with the size information and write the non-zero tail of the delta 15/30

Storage and Indexing (5/7) Compressed Indexes  Comparison of byte-wise compression vs. bit-wise compression for the Barton dataset  Each leaf page is compressed individually –It allows us to seek to any leaf page and directly start reading triples –The compressed index behaves just like a normal B + -tree 16/30

Storage and Indexing (6/7) Aggregated Indices  For many SPARQL patterns, indexing partial triples rather than full triples would be sufficient  Aggregated indexes –Each aggregated indexes store only two out of the three columns of a triple  (value1, value2, count )  This is done for (SP, PS, SO, OS, PO, OP) –All three one-value indexes  (value1, count)  This is done for (S, P, O) select ?a ?c where { ?a ?b ?c } select ?a ?c where { ?a ?b ?c } 17/30

Storage and Indexing (7/7) SPO SOP PSO POS OSP OPS Triple Index Count SP Count SO Count PS Count PO Count OP Count OS Count S P O Aggregate Index 18/30 Based on KS Kim’s presentation slide

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization –Translating SPARQL Queries –Optimizing Join Ordering  Selectivity Estimates  Evaluation  Conclusion 19/30

Query Processing and Optimization (1/2) Translating SPARQL Queries  Each query can be parsed and expanded into a set of triple patterns  The parser performs dictionary lookups, so the literals are mapped into ids  When a query consists of –a single pattern  Use index structures and answer the query with a single range scan –multiple triple pattern  Join the results of the individual patterns  When a query includes the distinct option, we eliminates duplicates in the result  Finally, a dictionary lookup operator converts the resulting ids back in to strings 20/30

Query Processing and Optimization (2/2) Optimizing Join Ordering  Demanding properties –Bushy join trees (rather than left-deep or right-deep trees) –Fast plan enumeration and cost estimation –Extensive use of merge joins  DP framework –To find best plan, consider all possible plans of subsets –Recursively compute costs for joining subsets to find the cost of each plan –When plan for any subset is computed, store it and reuse it –Larger plans are created by joining optimal solutions of smaller problems 21/30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates –Selectivity Histograms  Evaluation  Conclusion 22/30

Selectivity Estimates  Estimated cardinalities and selectivities have a huge impact on plan generation  Selectivity Histograms –The cardinality of a single triple pattern  Using aggregated indexes –The numbers of the join partners  Frequent join path 23/30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation –General Setup –Query Run-times  Conclusion 24/30

Evaluation (1/3) General Setup  Setup –2GHz dual core, 2GB RAM, 30MB/s disk, Linux  Competitors –MonetDB  Column-store-based approach  Presented in VLDB07, by Abadi et al. –PostgreSQL  Triple store with SPO, POS, PSO indexes, similar to Sesame –Other approaches performed much worse  Jena2, Yars2(DERI) 25/30

Evaluation (2/3) General Setup  Datasets –Barton, library data, 51M triples (4.1GB) –Yago, Wikipedia-based ontology, 40M triples (3.1GB) –LibraryThing(partial crawl), tags that users have assigned to the books, 30M triples (1.8GB)  DB load time & DB size 26/30

Evaluation (3/3) Query Run-times  Average run-times for cold caches (sec)  Average run-time for warm caches (sec) BartonYagoLibraryThing RDF-3X MonetDB PostgreSQL BartonYagoLibraryThing RDF-3X MonetDB PostgreSQL /30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 28/30

Conclusion  RDF-3X is a fast and flexible RDF/SPARQL engine –Exhaustive but very space-efficient triple indexes –Avoids physical design tuning, generic storage –Fast runtime system, query optimization has a huge impact 29/30