Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Management Information Systems, Sixth Edition
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Physical Database Monitoring and Tuning the Operational System.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
IST Databases and DBMSs Todd S. Bacastow January 2005.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CSC271 Database Systems Lecture # 30.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Hexastore: Sextuple Indexing for Semantic Web Data Management
PHP meets MySQL.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Ultrawrap: SPARQL Execution on Relational Data Juan F. Sequeda, Daniel P. Miranker University of Texas - Austin ISWC 2009 Seoul National University Internet.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
File Processing - Indexing MVNC1 Indexing Jim Skon.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Views Lesson 7.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure II Some of the slides are from slides of.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Adam Samuel R. Kate Abadi Marcus Madden MIT Daniel Hurwitz Technion:
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 5 Index and Clustering
Session 1 Module 1: Introduction to Data Integrity
B+ Trees: An IO-Aware Index Structure Lecture 13.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
CS4432: Database Systems II
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Module 11: File Structure
CHP - 9 File Structures.
Physical Changes That Don’t Change the Logical Design
Physical Database Design and Performance
COMP 430 Intro. to Database Systems
Physical Database Design for Relational Databases Step 3 – Step 8
Lecture 12 Lecture 12: Indexing.
Physical Database Design
Implementation of Relational Operations
ICOM 5016 – Introduction to Database Systems
Presentation transcript:

Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach David Yona Seminar On Databases & The Internet December 2010

2 Reminder The Semantic Web –“A web of data that can be processed directly and indirectly by machines.” (Tim Berners-Lee) –Enables sharing and integration of data across different applications and organizations –Allows users to issue structured queries and receive correct data from different data-sources

3 Reminder RDF (Resource Description Framework) –The Semantic Web’s choice of data model –Basic idea: Making statements about resources in the form of subject-predicate-object expressions –A single statement can take the form of a triple: –Multiple statements can be represented using a graph

4 Reminder RDF Example: Represent the fact that Serge Abiteboul, Rick Hull, and Victor Vianu wrote a book called “Foundations of Databases”: person1 isNamed "Serge Abiteboul" person2 isNamed "Rick Hull" person3 isNamed "Victor Vianu" book1 hasAuthor person1 book1 hasAuthor person2 book1 hasAuthor person3 book1 isTitled "Foundations of Databases" 7 triples

5 Reminder RDF Example: Represent the fact that Serge Abiteboul, Rick Hull, and Victor Vianu wrote a book called “Foundations of Databases”: Person1Person2Person3 Book1 Serge AbiteboulRick HullVictor Vianu isNamed hasAuthor Foundations of Databases isTitled

6 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

7 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

8 Main Issue How do we store and query RDF data in relational databases?

9 Introduction Main Issue: First Approach First approach: –Create a single three-column schema, and store all RDF triples using this table. –Advantages: Easy! –Disadvantages: Serious performance issues! Almost all interesting queries involve many self- joins over this single table.

10 Introduction Main Issue: First Approach First approach - Performance issues: –Example: Find all of the authors of books whose title contains the word “Transaction”: SELECT p5.obj FROM rdf AS p1, rdf AS p2, rdf AS p3, rdf AS p4, rdf AS p5 WHERE p1.prop = ’title’ AND p1.obj LIKE ’%Transaction%’ AND p1.subj = p2.subj AND p2.prop = ’type’ AND p2.obj = ’book’ AND p3.prop = ’type’ AND p3.obj = ’auth’ AND p4.prop = ’hasAuth’ AND p4.subj = p2.subj AND p4.obj = p3.subj AND p5.prop = ’isnamed’ AND p5.subj = p4.obj; Five self joins!

11 Introduction Main Issue: Other Approaches Solution? –Stop Using RDF? RDF has a great deal of momentum in the web community: Several international conferences, hundreds of attendees and dozens of papers –Explore ways to improve RDF query performance Focus on using a relational query processor to execute RDF queries Two different physical organization methods: –Property Table –Vertically partitioned database

12 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

13 Current State of the Art RDF in RDBMSs Most RDF data storage solutions are relational DBMSs (such as Jena, Oracle, Sesame, 3store) The common solution: A giant triples table, containing one row for each statement. Optimizations: –Store URIs and literal values in a different table, and use keys or shortend versions in the triple table –Indexes on all three columns

14 Current State of the Art RDF in RDBMSs Multi-layered architecture: Removes any dependence on the particular RDBMS used. –Top layer: RDF-specific functionality –Bottom Layer: RDBMS Queries are issued using RDF-specific query languages (SPARQL), converted to SQL and sent to the RDBMS –RDBMS optimizes and executes the SQL query over the triple store

15 Current State of the Art RDF in RDBMSs Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” SELECT ?title FROM table WHERE { ?book author ‘‘Fox, Joe’’ ?book copyright ‘‘2001’’ ?book title ?title } SELECT C.obj FROM TRIPLES AS A, TRIPLES AS B,TRIPLES AS C WHERE A.subj = B.subj AND B.subj = C.subj AND A.prop = ‘copyright’ AND A.obj = ‘‘2001’’ AND B.prop = ‘author’ AND B.obj = ‘‘Fox, Joe’’ AND C.prop = ‘title’ SPARQL SQL

16 Current State of the Art Property Tables An attempt to speed up queries over triple stores. Implementation: Denormalized RDF tables are physically stored in a wider, flattened representation. –This reduces the amount of subject-subject self joins needed. Two types of property tables: –Clustered property tables –Property-class tables

17 Current State of the Art Property Tables: Clustered property tables Contains clusters of properties that tend to be defined together. Each table is composed of a “subject” attribute, together with a set of attributes from a certain property cluster. Multiple property tables with different clusters of properties may be created. –A particular property may only appear in at most one property table.

18 Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” Current State of the Art Property Tables: Clustered property tables Subj.TypeTitleCopyright ID1BookType“XYZ”“2001” ID2CDType“ABC”“1985” ID3BookType“MNP”NULL ID4DVDType“DEF”NULL ID5CDType“GHI”“1995” ID6BookTypeNULL“2004” Property Table Subj.Prop.Obj. ID1author“Fox, Joe” ID2artist“Orr, Tim” ID2language“French” ID3language“English Left-Over Triples

19 Current State of the Art Property Tables: Property-class tables Exploits the “type” property of subjects to cluster similar sets of subjects together in the same table. A property may exist in multiple property-class tables.

20 Subj.TitleAuthorcopyright ID1“XYZ”“Fox, Joe”“2001” ID3“MNP”NULL ID6NULL “2004” Subj.Prop.Obj. ID1typeBookType ID1title“XYZ” ID1author“Fox, Joe” ID1copyright“2001” ID2typeCDType ID2title“ABC” ID2artist“Orr, Tim” ID2copyright“1985” ID2language“French” ID3typeBookType ID3title“MNO” ID3language“English” ID4typeDVDType ID4title“DEF” ID5typeCDType ID5title“GHI” ID5copyright“1995” ID6typeBookType ID6copyright“2004” Subj.TitleArtistcopyright ID2“ABC”“Orr, Tim”“1985” ID5“GHI”NULL“1995” Current State of the Art Property Tables: Property-class tables Class: BookType Left-Over Triples Class: CDType Subj.Prop.Obj. ID2language“French” ID3language“English” ID4typeDVDType ID4title“DEF”

21 Current State of the Art Property Tables: Advantages Most important advantage: Reduce subject-subject self joins. –“return the title of the book(s) Joe Fox wrote in 2001” Original table Property Table with “title”, “author” and “copyright”

22 Current State of the Art Property Tables: Disadvantages In the real world, most queries require joins or unions to combine data from several tables. –“Find out if there are any items in the catalog copyrighted before 1990 in a language other than English” SELECT T.subject, T.object FROM TRIPLES AS T, PROPTABLE AS P WHERE T.subject == P.subject AND P.copyright < 1990 AND T.property = "language" AND T.object != "English" (SELECT T.subject, T.object FROM TRIPLES AS T, BOOKS AS B WHERE T.subject == B.subject AND B.copyright < 1990 AND T.property = ‘language’ AND T.object != "English") UNION (SELECT T.subject, T.object FROM TRIPLES AS T, CDS AS C WHERE T.subject == C.subject AND C.copyright < 1990 AND T.property = ‘language’ AND T.object != "English") Class Clustered

23 Current State of the Art Property Tables: Disadvantages RDF data tends not to be very structured, and not every subject listed in the table will have all properties defined –Lots of NULL values. Multi-valued properties Subj.TitleAuthorcopyright ID1“XYZ”“Fox, Joe”“2001” ID3“MNP”NULL ID6NULL “2004” Book1 Person1 Person2 hasAuthor

24 Current State of the Art Property Tables: Summary Property tables can significantly improve performance by reducing the number of self- joins and typing attributes. Property tables introduce complexity by requiring property clustering to be carefully done to create property tables that are not too wide, while still being wide enough to answer most queries directly. Multi-valued attributes cause further complexity.

25 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

26 A Simpler Alternative Vertically Partitioned Approach Split the triples table into n two-column tables, where n is the number of unique properties in the data. In each table, the first column contains the subjects that define the property and the second column contains the object values for those subjects. Each table is sorted by subject.

27 A Simpler Alternative Vertically Partitioned Approach ID1BookType ID2CDType ID3BookType ID4DVDType ID5CDType ID6BookType ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1“2001” ID2“1985” ID5“1995” ID6“2004” ID2“French” ID3“English ID1“Fox, Joe” ID2“Orr, Tim” TypeTitle Copyright Artist Language Author

28 A Simpler Alternative Vertically Partitioned Approach: Advantages Support for heterogeneous records. –Very important when the data isn’t well structured. No clustering algorithms are needed. –Schema design is straightforward. Only the properties accessed by a query need to be read. Fewer unions –All data for a particular property is located in the same table

29 A Simpler Alternative Vertically Partitioned Approach: Advantages Tables are sorted by subject: –Particular subjects can be located quickly. –Fast merge joins can be used to reconstruct information about multiple properties for subsets of subjects. Support for multi valued attributes –If ID1 has two authors: ID1“Fox, Joe” ID1“Green, John” Author

30 A Simpler Alternative Vertically Partitioned Approach: Disadvantages Queries that access several properties –Merge joins not expensive, but not free either. Inserts into vertically partitioned tables are slower: –Multiple tables need to be accessed.

31 A Simpler Alternative Extending a Column-Oriented DBMS Row-Oriented DBMSs: –Entire tuples are stored consecutively on disk or in memory. –Not efficient when a few attributes are accessed per query. Column-Oriented DBMSs: –Main idea: Store tables as collections of columns rather than as collections of rows. –Projections occur for free. –Inserts are slower! –Common use: Storing big, wide tables where only a few attributes are queried at once.

32 A Simpler Alternative Extending a Column-Oriented DBMS ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1“XYZ” ID2“ABC” ID3“MNO” ID4“DEF” ID5“GHI” ID1, “XYZ” ID2, “ABC” ID3, “MNO” ID4, “DEF” ID5, “GHI” ID1, ID2, ID3, ID4, ID5 “XYZ”, “ABC”, “MNO”, “DEF”, “GHI” Row Oriented DB Column Oriented DB

33 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Tuple headers are stored separately –Tuple metadata includes transactions timestamps, number of attributes in tuple, null flags, etc. –Row-oriented DBMSs store headers at the beginning of each tuple. (Header size in Postgres: 27 bytes). –Data size in two-column tables: Usually up to 8 bytes (strings are dictionary encoded). –Column-stores put header information in separate columns and can selectively ignore it. –Result: Table scans perform 4-5 times faster in column stores.

34 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Optimization for fixed-length tuples –In row-stores, if any attribute is variable- length, then the entire tuple is variable- length. –This is the common case, and therefore row-stores are designed for this case: tuples are located through pointers in header. –In column-stores, fixed-length attributes are stored as arrays. In our case, both attributes are fixed-length

35 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Column-oriented data compression –Each attribute is stored separately. –Each attribute can be compressed separately using an algorithm best suited for that column. –Result: significant performance improvement.

36 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Carefully optimized column merge code –Merging columns is a very frequent operation in column stores. –Therefore, the merging code is carefully optimized to achieve high performance Example: Extensive pre-fetching is used when merging multiple columns

37 A Simpler Alternative Extending a Column-Oriented DBMS Advantages for storing 2-column tables Direct access to sorted files rather than indirection through B-Trees –In column-stores: The increased dependence on merge-sort joins necessitates that heap files are maintained in guaranteed sorted order. –In most row-stores: Order is guaranteed only through an index

38 A Simpler Alternative Extending a Column-Oriented DBMS Implementation Details An open-source column-oriented database system was extended in order to experiment with the ideas presented: –Name: C-Store –Characteristics: Each table is stored as a collection of columns. Each column is stored in a separate file. Each file contains a list of 64K blocks with as many values as possible packed in each block. Support for temporary tables, index-nested loop joins, unions, string-data operators wasn’t available.

39 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

40 Materialized Path Expressions In RDF data, object values can either be literals (“Fox, Joe”) or URIs (“ In the latter case, the value can be further described using additional triples: – Path expressions identify an object by describing how to navigate to it in some graph of objects

41 Materialized Path Expressions To find all books whose authors were born in 1860, we need a path expression through the data. In all three RDF storage schemas described, querying path expressions is expensive. –Querying path expressions is a common operation on RDF data.

42 Materialized Path Expressions In a triple schema, a path of length n requires (n-1) subject-object self joins. –Find all books whose authors were born in 1860: For the other schemas, (n-1) joins are required as well. –For vertically-partitioned schemas, all tables of properties involved must be joined, but these can’t be merge-joins. SELECT B.subj FROM triples AS A, triples AS B WHERE A.prop = wasBorn AND A.obj = “1860” AND A.subj = B.obj AND B.prop = “Author”

43 Materialized Path Expressions A graphical representation Book 1 Joe Fox 1860 Author wasBorn

44 Materialized Path Expressions What we want: Book 1 Joe Fox 1860 Author wasBorn SELECT A.subj FROM proptable AS A, WHERE A.author:wasBorn = “1860” Author:wasBorn

45 Materialized Path Expressions Solution: –Using a vertically partitioned schema, this author:wasBorn path expression can be pre-calculated and the result stored in its own two column table as if it were a regular property. The joins don’t have to be performed in run time. –Multi-value support is achieved as well

46 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

47 Benchmark Used to evaluate the performance of all three RDF databases. Based on publicly available library data and a collection of queries generated from a web-based user interface for browsing RDF content.

48 Benchmark Barton Data Data taken from the Barton Library dataset. Contain records acquired from an RDF-formatted dump of the MIT Libraries Barton catalog. Converted from RDF/XML syntax to triples. Duplicate triples were eliminated. The numbers: –50,255,599 triples –221 unique properties –82 (37%) properties are multi-valued –77% of the triples have a multi-valued property A good demonstration of the relatively unstructured nature of Semantic Web data.

49 Benchmark Longwell Overview A tool which provides a GUI for generic RDF data exploration in a web browser. Shows the user a list of currently filtered resources (RDF subjects) and a list of filters in the side-panels. User can select a filter and narrow down the presented data. Assumption: A set of 28 interesting properties over which the queries will be run has been selected before hand. –There are 26,761,389 triples for these properties.

50 Benchmark Longwell Overview Longwell Opening Screen

51 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “Text” in the Type Property Panel

52 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling down

53 Benchmark Longwell Overview Longwell Screen Shot After Clicking on “fre” in the Language Property Panel

54 Benchmark Longwell Queries Query 1: –Calculate the opening panel displaying the counts of the different types of data in the RDF store. –This requires a search for the objects and counts of those objects with property Type. There are 30 such objects. –For example: “Type: Text” has a count of 1,542,280, and “Type: NotatedMusic” has a count of 36,441.

55 Benchmark Longwell Queries Query 2: –The user selects “Type: Text” from the previous panel. –Longwell must present him with a list of other defined properties for resources of “Type: Text”. –It must also calculate the frequency of these properties. –For example, the Language property is defined 1,028,826 times for resources that are of “Type: Text”.

56 Benchmark Longwell Queries Query 3: –For each property defined on items of “Type: Text”, populate the property panel with the counts of popular object values for that property (where popular means that an object value appears more than once). –For example, the property Edition has 8 items with value “[1st ed. reprinted].”

57 Benchmark Longwell Queries Query 4: –This query recalculates all of the property- object counts from Q3 after the user clicks on the “French” value in the “Language” property panel. –Essentially this is narrowing the working set of subjects to those whose Type is Text and Language is French. –This query has a much higher-selectivity than Q3.

58 Benchmark Longwell Queries Query 5: –Here we perform a type of inference. –If there are triples of the form (X Records Y) and (Y Type Z) then we can infer that X is of type Z. Here “X Records Y” means that X records information about Y (for example, X might be a web page with information on Y). –For this query, we want to find the inferred type of all subjects that have this “Records” property defined that also originated in the US Library of Congress (i.e. contain triples of the form (X origin “DLC”)). –The subject and inferred type is returned for all non-Text entities.

59 Benchmark Longwell Queries Query 6: –For this query, we combine the inference first step of Q5 with the property frequency calculation of Q2 to extract information in aggregate about items that are either directly known to be of “Type: Text” (as in Q2) or inferred to be of “Type: Text” through the Q5 Records inference.

60 Benchmark Longwell Queries Query 7: –Finally, we include a simple triple selection query with no aggregation or inference. –The user tries to learn what a particular property (in this case, “Point”) actually means by selecting other properties that are defined along with a particular value of this property. –The user wishes to retrieve subject, Encoding, and Type of all resources with a Point value of “end”. The result set indicates that all such resources are of the type Date. –This explains why these resources can have “start” and “end” values: each of these resources represents a start or end date, depending on the value of Point.

61 Benchmark Evaluation The performance of all 7 queries is compared using three different schemas: –Triples schema –Property tables schema –Vertically partitioned schema Performance of each of the three schemas is studied in a row-store (Postgres), and for the vertically- partitioned schema, also in a column- store.

62 Benchmark Store Implementation Details All implementations feature a dictionary encoding table that maps strings to integer IDs. –These IDs are used instead of strings to represent properties, subject and objects. Encoding table has a clustered B+Tree index on the IDs and an unclustered B+Tree on the strings.

63 Benchmark Store Implementation Details Triple Store: –Table with three columns – subject, property, object. –Three B+Tree indices: “subject, property, object” – clustered “property, object, subject” – unclustered “object, subject, property” – unclustered –The list of 28 interesting properties is stored in a separate table. –Total storage space needed: 8.3GB

64 Benchmark Store Implementation Details Property Table Store –A property table for each query was created – each table containing only the columns accessed by that query. A table with all 28 interesting properties A table with subject and Text Etc. –In almost all cases, multi-valued attributes were stored as integer arrays. –Single-valued attributes that are used as selection predicates received B+Tree indices –All tables had a clustered index on “subject” –Total storage space needed: Over 14 GB

65 Benchmark Store Implementation Details Vertically Partitioned Store (Postgres) –One table per property. Each table contains a subject and an object column. –A clustered B+Tree index on “subject” –An unclustered B+Tree index on “object” –Multi-valued attributes are represented through multiple rows in the table. –Total storage space needed: 5.2GB

66 Benchmark Store Implementation Details Vertically Partitioned Store (C-Store) –Properties are stored on disk in separate files, in blocks of 64K. –Table structure as before. –Each property has a clustered B+Tree on “subject” –Single valued, low cardinality properties have a bit-map index on “object” –Total storage space needed: 2.7GB

67 Benchmark Query Implementation Details Query 1: –Triple store: No join needed Aggregation can occur directly on the object column after “property=Type” selection –Vertically partitioned table / Column store: Aggregate the object values for the Type table. –Property table: Same as vertically partitioned table. SELECT A.obj, count(*) FROM triples AS A WHERE A.prop = " " GROUP BY A.obj

68 Benchmark Query Implementation Details Query 5: –Triple store: Selection on “property=Origin” and “object=DLC”, then self-join on subject. For subjects with a “Records” property, subject-object join is performed. –Property table: Selection is applied on “Origin=DLC”. “Records” column is projected and (self) joined with the “subject” column of the original table. The “Type” values of the join results return. –Vertically partitioned table / Column store: “Object=DLC” selection on “Origin” property, join subjects with “Records” table, perform subject-object join of “Type”-”Records” to get the results. SELECT B.subj, C.obj FROM triples AS A, triples AS B, triples AS C WHERE A.subj = B.subj AND A.prop = " “ AND A.obj = " " AND B.prop = " “ AND B.obj = C.subj AND C.prop = " “ AND C.obj != " "

69 Benchmark Query Implementation Details Query 6: –Triple store: The query first finds subjects that are directly of “Type=Text” using selection, then finds subjects that are inferred to be “Type=Text” through subject- object join of “Records” property. Next, the other properties are found using a self-join on “subject”, and finally a count is performed. –Property table, Vertically partitioned table, Column store: Create temporary tables in a manner similar to Query 5 SELECT A.prop, count(*) FROM triples AS A, properties AS P ( (SELECT B.subj FROM triples AS B WHERE B.prop = " " AND B.obj = " ") UNION (SELECT C.subj FROM triples AS C, triples AS D WHERE C.prop = " " AND C.obj = D.subject AND D.prop = " " AND D.obj = " ") ) AS uniontable WHERE A.subj = uniontable.subj AND P.prop = A.prop GROUP BY A.prop;

70 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

71 Results

72 Results Property table and vertical partitioning approaches perform 2-3 times faster than the triple store approach. C-Store added another factor of 10 performance improvement.

73 Results Performance Differences Query 1: –Property table and vertical partitioning numbers are identical: Idealized property tables were used. –Triple-store not too slow: No self-joins. –C-Store performs an order of magnitude better: Table size is 4 times smaller. Joining keys with string dictionary table: C-Store uses index nested loop joins, Postgres uses merge-join.

74 Results Performance Differences Query 2: –Triple-store: Many subject-subject joins. –Property table vs. vertical partitioning: Vertical partitioning approach must perform 28 merge-joins, property table doesn’t perform even one. Query 3: –Property table hurt by multiple sequential scans: Grouping must be per property and then object for each column, and therefore each column must group by the object values in that particular column. Query 4: –Highly selective. Query 5-7 –Subject-object joins hurt each of the stores significantly.

75 Results Performance Differences A vertically partitioned database provides a significant improvement over the triple store schema. Vertical partitioning vs. property tables in a row store: –Performance –Simplicity Vertical partitioning in a column-store

76 Results Scalability The magnitude of query performance is important. How performance scales with size of data is at least as important.

77 Results Materialized Path Expressions Expensive subject-object joins replaced with cheaper subject-subject joins. Queries 5-6 rerun using MPEs. Implementation: –Property table: New column. –Vertically partitioned table: New table. Q5Q6 Property Table39.49 (17.5% faster)62.6 (38% faster) Vertical Partitioning4.42 (92% faster)65.84 (22% faster) C-Store2.57 (84% faster)2.70 (75% faster)

78 Results The Effect of Further Widening The Semantic-Web content is likely to have an unstructured schema. Property tables in the benchmark were pre- optimized. Result of adding extra columns: QueryWide Property TableProperty Table % slowdown Q % Q % Q % Q % Q % Q % Q %

79 RoadMap I.Introduction II.Current State of Art III.A Simpler Alternative IV.Materialized Path Expressions V.Benchmark VI.Results VII.Conclusion

80 Conclusion The emergence of the Semantic-Web necessitates high-performance data management tools to manage the tremendous collections of RDF data being produced. Current state of the art RDF databases – triple-stores – perform and scale extremely poorly. The previously proposed “property table” optimization has not been adopted in most RDF databases, and has many disadvantages. Vertically partitioning tables achieve similar performance as property tables in a row-oriented database, and outperform other solutions in a column-oriented database.

Questions?

Thank You