Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu
Scalable RDF Store Based on HBase and MapReduce Liu Chunqiu
OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion
RDF Schema Example In the example above, the resource "horse" is a subclass of the class "animal". ex:horse rdfs:subClassOf ex:animal
OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion
Physical Designs for RDF Storage (1/4) Giant Triples Table SELECT ?title WHERE { ?book ?title. ?book. ?book } Entire Table Scan! Redundancy!
Physical Designs for RDF Storage (2/4) Clustered Property Table Contains clusters of properties that tend to be defined together
Physical Designs for RDF Storage (3/4) Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Values of the type property
Physical Designs for RDF Storage (4/4) Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to Maintain null values Have a certain clustering algorithm subject property object
OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion
HBase A data row = a sortable rowkey + an arbitrary number of columns Columns are grouped into column families A data cell can hold multiple versions which are distinguished by timestamp HBase is like a multi-dimensional sorted map : (Row Key, Family : Column,Timestamp )-----> Value Row Key: Key1 { Column Family A { Column X: T1 Value1 T2 Value2 Column Y: T3 Value3 } Column Family B { Column Z: T4 Value4 1.Data stored in the same column family are stored together on file system,while might be distributed 2.Hbase provides a B+ tree-like index on row key by default 3.Secondary Indexes on columns can also be created with extra efforts
MapReduce
OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion
Scalable RDF Store Based on HBase and MapReduce Present a RDF storage schema on HBase. The storage schema considers characteristics of both RDF data model and HBase structure Present a MapReduce algorithm for SPARQL Basic Graph Pattern Processing Evaluation
The Storage Schema on HBase build up six tables to store RDF triples : S_PO,P_SO, O_SP, PS_O, SO_P and PO_S Data are stored in row keys and column names These six tables have different row key and column name combinations Row Key: Subject URI I { Column Family: VALUE { Column: (predicate1, object1) Column: (predicate2, object2) Column: (predicate3, object3) } S_PO data row Column value is null !
The Storage Schema on HBase P1: (s, p, o) P2: (?s, p, o) P3: (s,? p, o) P4: (s, p,? o) P5: (?s,? p, o) P6: (s, ?p,? o) P7: (?s, p,? o) P8: (?s,? p,? o) S_PO P_SO O_SP SP_O SO_P PO_S P1 and P8 can be handled by any table
The Storage Schema on HBase Advantages Support for multi-valued properties Support for parallel processing Reduced I/O cost Suitable for HBase Disadvantages Requires more storage space The update operation will be more complicated
Query Processing SELECT ?x ?y ?z WHERE { ?x rdf:type ub:GraduateStudent. ?y rdf:type ub:University. ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y. ?x ub:undergraduateDegreeFrom ?y. } LUBM Q2 ( x, y, z ) x y (x, y)(x, y, z) (y, z) (x, z) z Join on x Join on y Join on (x, y) Join on z A MapReduce job
Process of MapReduce Assign unique ids to all triple patterns Retrieve results for all triple patterns and persist results to file system Calculate expressions for triple patterns and final result Create and submit the MapReduce job Mappers receive input (k, v) pairs Reducer retrieve (k, v) pairs Input of mapper : (1, Department0) (2, (Student0, Department0)) (3, (Department0, University0)) Output Of mapper: (Department0, (1, )) (Department0, (2, Student0)) (Department0, (3, University0)) (a) Output of reducer: (Student0,University0, Department0) (b) ? x ub: memberOf ?z (x,z) ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y.
Process of MapReduce Notice: If the query doesn't need to join, then there will not be any MapReduce job LUBM Query6 SELECT ?X WHERE { ?X rdf:type ub:Student } NO need to join!!!! The result of the only triple pattern will be directly used as final output
Evaluation
UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query
Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query Q1 is the fastest among rest queries
Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) Q6 is the most efficient query Q1 is the fastest among rest queries Performance result against LUBM(20, 0) is not quite efficient Compare results of LUBM(20,0)and LUBM(100,0),MapReduce shows great scalability when data size grows
OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion
Conclude and Think This approach works against large RDF dataset Waste a lot of storage space MapReduce algorithm need to be more efficient Enrich storage schema Use data compression Reduce execution steps during query processing Integrate with other MapReduce algorithm SO