Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Adam Samuel R. Kate Abadi Marcus Madden MIT Daniel Hurwitz Technion: Israel Institute of Technology Computer Science Dept. Seminar in Databases –
It’s All Semantics, Anyway! “All our work, our whole life is a matter of semantics, …Everything depends on our understanding of them.” – Felix Frankfurter The Semantic Web ▫Tim Berners-Lee: “I have a dream…” ▫Sharing the wealth ▫W3C Technicalities ▫Implementation ▫Access 2
Resource Description Framework(RDF) You’d make a great model Representing information ▫Semantics ▫Graph of resource relations ▫Statements about resources Subject Property Object No storage requirements Remind me how we’re related? 3
RDF Triples Semantic breakdown ▫“Rick Hull wrote Foundations of Databases.” Representation ▫Graph ▫Statement ▫XML format Foundations of DatabasesRick Hull hasAuthor Rick Hull 4
Triples Storage Relational database ▫3-column schema Performance issues ▫Waiting is the rust of the soul ▫One massive triples table ▫Queries require many self-joins SELECT C.obj FROM TRIPLES AS A, TRIPLES AS B, TRIPLES AS C WHERE A.subj = B.subj AND B.subj = C.subj AND A.prop = ‘copyright’ AND A.obj = “2001” AND B.prop = ‘author’ AND B.obj = “Fox, Joe” AND C.prop = ‘title’ 5
Getting Down to Business I.Current State of the Art RDF in RDBMs II.A Simple Alternative Vertically partitioned approach III.Benchmarks Querying the candidates IV.Evaluation Storage requirements and implementation V.Results 6
Current State of the Art Majority use RDBMs Multi-layered architecture Querying: SPARQL converted to SQL RDF layer RDBM Result SetSQL query SPARQL queryRDF in XML/Graph SELECT ?title FROM table WHERE { ?book author “Fox, Joe” ?book copyright “2001” ?book title ?title } SELECT C.obj FROM TRIPLES AS A, TRIPLES AS B, TRIPLES AS C WHERE A.subj = B.subj AND B.subj = C.subj AND A.prop = ‘copyright’ AND A.obj = “2001” AND B.prop = ‘author’ AND B.obj = “Fox, Joe” AND C.prop = ‘title’ 7
Property Table Technique Goal: speed up queries over triple-stores Idea: cluster triples containing properties defined over similar subjects ▫Example: “title”, “author”, “copyright” Books, journals, CDs, etc. Reduces number of self-joins 8
Clustered Property Table 9
Property-Class Table 10
Property Tables: Issues NULLs Multi-valued attributes Proliferation of unions and joins 11 Rick Hull hasAuthor John Green hasAuthor Foundations of Databases
Property Tables Summary The Good ▫Reduce subject-subject self-joins The Bad ▫Sluggish on cross-table joins ▫How do we cluster property tables? 12
Getting Down to Business I.Current State of the Art RDF in RDBMs II.A Simple Alternative Vertically partitioned approach III.Benchmarks Querying the candidates IV.Evaluation Storage requirements and implementation V.Results 13
Vertically Partitioned Approach Goal: speed up queries over triples-store Idea: one table per property ▫Column 1: Subjects ▫Column 2: Objects 14
Vertically Partitioned Approach 15
Vertically Partitioned Approach: Advantages Support for multi-valued attributes Support for heterogeneous records 16
Vertically Partitioned Approach: Advantages Access requested properties only No need for clustering algorithms Less is more: fewer and faster joins 17
Vertically Partitioned Approach: Disadvantages More joins than property tables ▫Multi-property queries – merge joins Slower insertions into tables ▫Multiple-table access for same-subject statements ▫Solution: batch insertions Standard DBMSs not optimal for this approach 18
DB Orientation: Column vs Row Row-Oriented DBMS Column-Oriented ID1, “XYZ”ID2, “ABC” ID3, “MNO” ID4, “DEF” ID5, “GHI” … DBMS Memory File ID1, ID2, ID3, ID4, ID5 “XYZ”, “ABC”, “MNO”, “DEF”, “GHI” … DBMS Memory File 19
Jargon for the Noggin’ Tuple Tuple metadata ▫Timestamp ▫Number of attributes ▫NULL flags 20
Column-Oriented DBMS + Only relevant columns are retrieved - Slower insertions Advantages for Vertical Partitioning: ▫Separate tuple metadata ▫Fixed-length tuples ▫Column-oriented data compression ▫Optimized merge code 21
Materialized Path Expressions Problem: for a path of length n properties ▫n-1 subject-object joins required E.g. Find books whose authors were born in 1860 BooksAuthors“1860” hasAuthor wasBorn 22
Materialized Path Expressions Goal: eliminate joins across multiple tables How: Combine property paths into a single table SELECT B.subj FROM triples AS A, triples AS B WHERE A.prop = wasBorn AND A.obj = “1860” AND A.subj = B.obj AND B.prop = “Author” SELECT A.subj FROM predtable AS A, WHERE A.author:wasBorn = “1860” BooksAuthors“1860” hasAuthor wasBorn hasAuthor:wasBorn 23
Materialized Path Expressions: Breakdown Precalculate path expression ▫No join at query time ▫Easy implementation in vertically partitioned schema Simply add table “hasAuthor:wasBorn” Property Table Technique: Add column “hasAuthor:wasBorn” Added cost: recalculating after insertions 24
Getting Down to Business I.Current State of the Art RDF in RDBMs II.A Simple Alternative Vertically partitioned approach III.Benchmarks Querying the candidates IV.Evaluation Storage requirements and implementation V.Results 25
Benchmark: Dataset Barton Libraries ▫50 million triples 77% multi-valued ▫221 unique properties 37% multi-valued ▫Good representation of Semantic Web data RDF/XML converted into triples 26
Benchmark: Longwell GUI for exploring RDF data User applies filters to property panels Longwell-style queries provide realistic benchmark for testing 27
Benchmark: Longwell GUI 28
Benchmark: Longwell queries 7 queries were chosen Each query represents typical browsing session ▫Exercises on query diversity 29
Getting Down to Business I.Current State of the Art RDF in RDBMs II.A Simple Alternative Vertically partitioned approach III.Benchmarks Querying the candidates IV.Evaluation Storage requirements and implementation V.Results 30
Evaluation: Schema Implementations Performance comparison of all 3 schemas 1.Triple Store 2.Property Table Store 3.Vertically Partitioned Store A.Row-oriented (Postgres) B.Column-oriented (C-Store) 31
Evaluation: Size Matters Memory usage per implementation 1.Triple Store GBytes 2.Property Table store - 14 GBytes 3.Vertically Partitioned Store (Postgres) GBytes 4.Vertically Partitioned Store (C-Store) GBytes 32
Getting Down to Business I.Current State of the Art RDF in RDBMs II.A Simple Alternative Vertically partitioned approach III.Benchmarks Querying the candidates IV.Evaluation Storage requirements and implementation V.Results 33
Results 34
Scalability How does performance scale with size of data? Increased number of triples from 1 million to 50 million. 35
Results: Scalability Vertical partitioning schemes scale linearly Triple-store scales super-linearly ▫Prevalent sorting operations 36
Results: Materialized Path Expressions 37
Results: Further Widening 38
Summary Semantic Web users require fast responses to queries Current triple-stores just don’t cut it ▫Can’t stand up to sluggish self-joins Property tables are good, but have their limitations Vertical partitioning takes the cake ▫Competes with optimal performance of property table solution ▫Step toward an interactive-time Semantic Web 39
Thank you! Questions? 40