Download presentation
Presentation is loading. Please wait.
Published byBarrie Gibson Modified over 9 years ago
1
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System Centric Optimization, VLDB, 2008 2009-02-05 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
2
Copyright 2009 by CEBT Overview Goal Building a new type of TripleStore => RDF-3X Compare RDF-3X with traditional ones In this presentation, Focusing on physical storage design that had an effect on entire implementation of the system Center for E-Business Technology
3
Copyright 2009 by CEBT Introduction RDF: Resource Description Framework Conceptually a labeled graph In RDF, all data items are represented in the form of – (subject, predicate, object), aka (subject, property, value) RDF data can be seen as a (potentially huge) set of triples Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… 2009 IDS Lab. Winter Seminar – 3/22
4
Copyright 2009 by CEBT Introduction SPARQL: SPARQL Protocol and RDF Query Language The official standard for searching over RDF storages Example – Retrieve the titles of all movies with Johnny Depp SPARQL queries are pattern matching queries on triples that are stored in the RDF storage Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… Each pattern consists of S, P, O, and each of these is either a variable or a literal Each pattern consists of S, P, O, and each of these is either a variable or a literal
5
Copyright 2009 by CEBT Physical Designs for RDF Storage (1/4) Giant Triples Table Center for E-Business Technology SELECT ?title WHERE { ?book ?title. ?book. ?book } Join! Join! Entire Table Scan! Redundancy!
6
Copyright 2009 by CEBT Physical Designs for RDF Storage (2/4) Clustered Property Table Contains clusters of properties that tend to be defined together Center for E-Business Technology
7
Copyright 2009 by CEBT Physical Designs for RDF Storage (3/4) Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Center for E-Business Technology Values of the type property
8
Copyright 2009 by CEBT Physical Designs for RDF Storage (4/4) Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to – Maintain null values – Have a certain clustering algorithm Center for E-Business Technology subject property object
9
Copyright 2009 by CEBT RDF-3X Technical Challenges The diversity of predicate names pose major problem for the physical database design – Join, Redundancy,.. RDF-3X (RDF Triple eXpress) A novel architecture for RDF indexing and querying, eliminating the need for physical database design Center for E-Business Technology
10
Copyright 2009 by CEBT Mapping Dictionary Replacing all literals by unique IDs using a mapping dictionary RDF-3X is based on a single “giant triples table”, but Mapping dictionary compresses the triple store – Reduced redundancy, Saving a lot of physical space Center for E-Business Technology SPO object214hasColorblue object214belongsToobject352 ……… SPO 012 034 ……… IDValue 0object214 1hasColor ……
11
Copyright 2009 by CEBT Clustered B + -Tree Store everything in a clustered B + -Tree Triples are sorted in lexicographical order – Allowing the conversion of SPARQL patterns into range scan We don’t have to do entire table scan Center for E-Business Technology 002… 000001002003 SPO 012 034 ……… Actually, we don’t need this table! IDValue 0object214 1hasColor ……
12
Copyright 2009 by CEBT Exhaustive Indexing We relied on the fact that the variables are a suffix - - ?var, - ?var1 - ?var2 But, ?var - - – To guarantee that we can answer every possible pattern with variables in any position of the pattern triple by merely a single index scan, we maintain all six possible permutations of S, P, and O in six separate indexes – (SPO, SOP, OSP, OPS, PSO, POS) – We can afford this level of redundancy – On all experimental datasets, the total size for all indexes together is less than the original data Center for E-Business Technology ?var - -
13
Copyright 2009 by CEBT Moreover, … Aggregated Indices Sometimes we don’t need the full triple – Is there a connection between obj4 and obj13? – How many author does object14 have? Therefore maintain aggregated indexes with (value1, value2, count) – (value1, value2) => (SP, PS, SO, OS, PO, OP) – We can use clustered B+ tree Other Features Join ordering Selectivity estimation … Center for E-Business Technology
14
Copyright 2009 by CEBT An Experimental Setup Setup 2GHz dual core, 2GB RAM, 30MB/s disk, Linux Competitors MonetDB – column-store-based (vertically partitioned) approach – Presented in VLDB07, by Abadi et al. PostgreSQL – Triple store with SPO, POS, PSO indexes, similar to Sesame Other approaches performed much worse – Jena2, Yars2(DERI), … Datasets Barton, library data, 51 mil. triples (4.1 GB) Yago, Wikipedia-based ontology, 40 mil. triples (3.1 GB) LibraryThing(partial crawl), users tag books, 30 mil. triples (1.8 GB) Benchmark queries (7 or 8 per dataset) - appendix Center for E-Business Technology
15
Copyright 2009 by CEBT DB Load Time & DB Size Center for E-Business Technology BartonYagoLibThing RDF-3X132520 MonetDB11214 PostgreSQL302520 DB Load Time (min.) BartonYagoLibThing RDF-3X2.82.71.6 MonetDB1.61.10.7 PostgreSQL8.77.55.7 DB Size (GB) Good Bad! After running the benchmark 2.0 2.4 6.9
16
Copyright 2009 by CEBT Query Run-times Center for E-Business Technology BartonYagoLibThing RDF-3X0.4(5.9)0.04(0.7)0.13(0.89) MonetDB4.8(26.4)54.6(78.2)4.39(8.16) PostgreSQL64.3(167.8)0.56(10.6)30.4(93.9) Average run-times for warm(cold) cache (sec.)
17
Copyright 2009 by CEBT Conclusion RDF-3X(RDF Triple eXpress) is a fast and flexible RDF/SPARQL engine Exhaustive but very space-efficient triple indexes Avoids physical design tuning, generic storage Fast runtime system, query optimization has a huge impact RDF-3X is freely available http://www.mpi-inf.mpg.de/~neumann/rdf3x Center for E-Business Technology
18
Copyright 2009 by CEBT Paper Evaluation Pros Good Idea Introduce & Solve Optimization Issues Implementation My Comments Real examples about optimization issues RISC-style? – Most operators merely process integer-encoded IDs, consume and produce streams of ID tuples, compare IDs, etc... ?? Insert & Update & Delete ? Namespace Center for E-Business Technology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.