RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System Centric Optimization, VLDB, Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
Copyright 2009 by CEBT Overview Goal Building a new type of TripleStore => RDF-3X Compare RDF-3X with traditional ones In this presentation, Focusing on physical storage design that had an effect on entire implementation of the system Center for E-Business Technology
Copyright 2009 by CEBT Introduction RDF: Resource Description Framework Conceptually a labeled graph In RDF, all data items are represented in the form of – (subject, predicate, object), aka (subject, property, value) RDF data can be seen as a (potentially huge) set of triples Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… 2009 IDS Lab. Winter Seminar – 3/22
Copyright 2009 by CEBT Introduction SPARQL: SPARQL Protocol and RDF Query Language The official standard for searching over RDF storages Example – Retrieve the titles of all movies with Johnny Depp SPARQL queries are pattern matching queries on triples that are stored in the RDF storage Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… Each pattern consists of S, P, O, and each of these is either a variable or a literal Each pattern consists of S, P, O, and each of these is either a variable or a literal
Copyright 2009 by CEBT Physical Designs for RDF Storage (1/4) Giant Triples Table Center for E-Business Technology SELECT ?title WHERE { ?book ?title. ?book. ?book } Join! Join! Entire Table Scan! Redundancy!
Copyright 2009 by CEBT Physical Designs for RDF Storage (2/4) Clustered Property Table Contains clusters of properties that tend to be defined together Center for E-Business Technology
Copyright 2009 by CEBT Physical Designs for RDF Storage (3/4) Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Center for E-Business Technology Values of the type property
Copyright 2009 by CEBT Physical Designs for RDF Storage (4/4) Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to – Maintain null values – Have a certain clustering algorithm Center for E-Business Technology subject property object
Copyright 2009 by CEBT RDF-3X Technical Challenges The diversity of predicate names pose major problem for the physical database design – Join, Redundancy,.. RDF-3X (RDF Triple eXpress) A novel architecture for RDF indexing and querying, eliminating the need for physical database design Center for E-Business Technology
Copyright 2009 by CEBT Mapping Dictionary Replacing all literals by unique IDs using a mapping dictionary RDF-3X is based on a single “giant triples table”, but Mapping dictionary compresses the triple store – Reduced redundancy, Saving a lot of physical space Center for E-Business Technology SPO object214hasColorblue object214belongsToobject352 ……… SPO ……… IDValue 0object214 1hasColor ……
Copyright 2009 by CEBT Clustered B + -Tree Store everything in a clustered B + -Tree Triples are sorted in lexicographical order – Allowing the conversion of SPARQL patterns into range scan We don’t have to do entire table scan Center for E-Business Technology 002… SPO ……… Actually, we don’t need this table! IDValue 0object214 1hasColor ……
Copyright 2009 by CEBT Exhaustive Indexing We relied on the fact that the variables are a suffix - - ?var, - ?var1 - ?var2 But, ?var - - – To guarantee that we can answer every possible pattern with variables in any position of the pattern triple by merely a single index scan, we maintain all six possible permutations of S, P, and O in six separate indexes – (SPO, SOP, OSP, OPS, PSO, POS) – We can afford this level of redundancy – On all experimental datasets, the total size for all indexes together is less than the original data Center for E-Business Technology ?var - -
Copyright 2009 by CEBT Moreover, … Aggregated Indices Sometimes we don’t need the full triple – Is there a connection between obj4 and obj13? – How many author does object14 have? Therefore maintain aggregated indexes with (value1, value2, count) – (value1, value2) => (SP, PS, SO, OS, PO, OP) – We can use clustered B+ tree Other Features Join ordering Selectivity estimation … Center for E-Business Technology
Copyright 2009 by CEBT An Experimental Setup Setup 2GHz dual core, 2GB RAM, 30MB/s disk, Linux Competitors MonetDB – column-store-based (vertically partitioned) approach – Presented in VLDB07, by Abadi et al. PostgreSQL – Triple store with SPO, POS, PSO indexes, similar to Sesame Other approaches performed much worse – Jena2, Yars2(DERI), … Datasets Barton, library data, 51 mil. triples (4.1 GB) Yago, Wikipedia-based ontology, 40 mil. triples (3.1 GB) LibraryThing(partial crawl), users tag books, 30 mil. triples (1.8 GB) Benchmark queries (7 or 8 per dataset) - appendix Center for E-Business Technology
Copyright 2009 by CEBT DB Load Time & DB Size Center for E-Business Technology BartonYagoLibThing RDF-3X MonetDB11214 PostgreSQL DB Load Time (min.) BartonYagoLibThing RDF-3X MonetDB PostgreSQL DB Size (GB) Good Bad! After running the benchmark
Copyright 2009 by CEBT Query Run-times Center for E-Business Technology BartonYagoLibThing RDF-3X0.4(5.9)0.04(0.7)0.13(0.89) MonetDB4.8(26.4)54.6(78.2)4.39(8.16) PostgreSQL64.3(167.8)0.56(10.6)30.4(93.9) Average run-times for warm(cold) cache (sec.)
Copyright 2009 by CEBT Conclusion RDF-3X(RDF Triple eXpress) is a fast and flexible RDF/SPARQL engine Exhaustive but very space-efficient triple indexes Avoids physical design tuning, generic storage Fast runtime system, query optimization has a huge impact RDF-3X is freely available Center for E-Business Technology
Copyright 2009 by CEBT Paper Evaluation Pros Good Idea Introduce & Solve Optimization Issues Implementation My Comments Real examples about optimization issues RISC-style? – Most operators merely process integer-encoded IDs, consume and produce streams of ID tuples, compare IDs, etc... ?? Insert & Update & Delete ? Namespace Center for E-Business Technology