gStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong Kong University of Science and Technology, 3 University of Waterloo
Outline Background & Related Work Overview of gStore Encoding Technique VS*-tree & Query Algorithm Experiments Conclusions 2
Outline Background & Related Work Overview of gStore Encoding Technique VS*-tree & Query Algorithm Experiments Conclusions 3
Semantic Web 4 “Semantic Web Technologies” is a collection of standard technologies to realize a Web of Data.
RDF Data Model 5 URI Literals
RDF Graph 6 Entity Vertex Literal Vertex
SPARQL Queries 7 SPARQL Query: Select ?name Where { ?m ?name. ?m “ ”. ?m “ ”. } Query Graph
Subgraph Match vs. SPARQL Queries 8
Naïve Triple Store 9 SPARQL Query: Select ?name Where { ?m ?name. ?m “ ”. ?m “ ”. } SQL: Select T3.Subject From T as T1, T as T2, T as T3 Where T1.Predict=“BornOnDate” and T1.Object=“ ” and T2.Predict=“DiedOnDate” and T2.Object=“ ” and T3. Predict=“hasName” and T1.Subject = T2.Subject and T2. Subject= T3.subject Too many Self-Joins
Existing Solutions Three categories of solutions are proposed to speed up query processing: 1.Property Table; Jena [K. Wilkinson et al. SWDB 03], … 2. Vertically Partitioned Solution; SW-store [D. J. Abadi et al. VLDB 07],… 3. Exhaustive-Indexing RDF-3x [T. Neumann et al. VLDB 08], Hexastore [C. Weiss et al. VLDB 08 ],… 10
Existing Solutions-Property Table 11 SPARQL Query: Select ?name Where { ?m ?name. ?m “ ”. ?m “ ”. } SQL: Select People.hasName from People where People.BornOnDate = “ ” and People.DiedOnDate = “ ”. Reducing # of join steps
Existing Solutions- Vertically Partitioned Solution 12 Fast Merge Join
Existing Solutions- Exhaustive-Indexing Each SPARQL query statement can be translated into one “range query”. SPARQL Query: Select ?name Where { ?m ?name. ?m “ ”. ?m “ ”. } 13 Range query & Merge Join
Some Limitations 1.Difficult to handle ``wildcard queries’’. 2.Difficult to handle updates. 14
Outline Background & Related Work Overview of gStore Encoding Technique VS*-tree & Query Algorithm Experiments Conclusions 15
Intuition of gStore 16 Finding Matches over a Large Graph is not a trivial task.
Preliminaries 17 Entity Vertex Literal Vertex
Preliminaries RDF graph 18
Preliminaries Query Graph 19
Preliminaries match 20
Preliminaries Problem definition 21
Storage Schema in gStore 22 Encoding all neibhors into a “bit-string”, called signature.
Encoding Technique (1) |eSig(e).e| = M. we employ m different string hash functions Hi (i = 1,...,m) For each hash function Hi, we set the (Hi(eLabel) MOD M)-th bit in eS ig(e).e to be ‘1’ Encoding Sig(e).n is the same – |eSig(e).n| = N – n different hash functions 23
Encoding Technique (2) 24 “Abr”, “bra”, ”rah”, ”aha”, …., ( hasName, “Abraham Lincoln”) OR ( BornOnDate, “ ”) ( DiedOnDate, “ ”) ( DiedIn, “y:Washington_D.c”) OR
Encoding Technique (3) 25
Encoding Technique (4) 26 Finding Matches over signature graph G* Verify Each Match in RDF Graph G
Encoding Technique (5) 27
Outline Background & Related Work Overview of gStore Encoding Technique VS-tree & Query Algorithm Experiments Conclusions 28
A Straightforward Solution (1) u1u1 u2u2 L1L1 L2L2
A Straightforward Solution (2) Large Join Space ! L1L1 L2L2
VS-tree 31
VS-Tree query definition 32
Pruning Technique 33 u1u1 u2u Reduced Join Space!
Query Algorithm-Top-Down 34
Optimized method Too many super edges Which level to start search No brute-force enumeration 35
VS*-Tree Insert The criterion in the VS-tree only depends on the Hamming distance between the signatures of u and the node in VS-tree. the criterion in VS ∗ - tree depends on both node signatures and G ∗ ’s structure 36
Updates- Insertion in G* 37
Updates- Insertion in VS*-tree 38
VS*-Tree split the B+1 entities of the node will be partitioned into two new nodes, where B is the maximal fanout for a node in VS ∗ -tree. 1. we find two entities that have the maximal Hamming distance between them as two seed nodes 2. we associate each left entry with the nearest seed node, according to Equation 1. 39
VS*-Tree deletion Similar to split if some node d has less than b entries, where b is the minimal fanout of node in VS ∗ -tree, then d is deleted and its entries are reinserted into VS ∗ -tree. 40
Updates- Deletion in VS*-tree 41 To be deleted
Which Level To Begin a concept “pruning power” of G I with regard to Q ∗ denoted as P(Q ∗,G I ) 42
Estimate P(Q*,G I ) 43
Finding Valid Child States propose a DFS strategy to find all valid child states of J. start a DFS over G ∗ beginning from some vertex vi 44
45
Outline Background & Related Work Overview of gStore Encoding Technique VS*-tree & Query Algorithm Experiments Conclusions 46
Datasets 47 Triple #Size Yago20 million3.1GB DBLP8 million0.8 GB
48 Offline Performance
Exact Queries 49
Wildcard Queries 50
Outline Background & Related Work Overview of gStore Encoding Technique VS*-tree & Query Algorithm Experiments Conclusions 51
Conclusions Vertex Encoding Technique; An Efficient index Structure: VS-tree; A Novel Filtering Technique. 52
53