gStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {
Agenda Introduction Preliminaries Overview of gStore Storage Scheme and Encoding Technique Indexing Structure and Query Algorithm Optimized methods Experiments and their results Conclusions
Introduction -1/4 What is RDF? – Building block of semantic web – Represented as a collection of triples : (Subject,Property,Object) Prefix: y= SubjectPropertyObject y:Abraham LincolnhasNameAbraham Lincoln y:Abraham LincolnBornOnDate y:Abraham LincolnDiedOnDate y:Abraham LincolnDiedIny:Washington_D.C hasName“Washington D.C” y:Washington_D.CFoundYear1790 y:Washington_D.Crdf:typey:city y:United_StateshasName“United States” y:United_StateshasCapitaly:Washington_D.C y:United_Statesrdf:typeCountry y:Reese_Witherspoonrdf:typey:Actor y:Reese_WitherspoonBornOnDate“ ” y:Reese_WitherspoonBornIny:New_Orleans_Louisiana y:Reese_WitherspoonhasName“Reese Witherspoon” y:New_Orleans_LouisianaFoundYear1718 y:New_Orleans_Louisianardf:typey:city y:New_Orleans_LouisianalocatedIny:United_States
Introduction 2/4:RDF Graph
Introduction - 3/4 What is SPARQL? Sample query: Select ?name Where { ?m ?name. ?m “ ” ?m “ ” } Query with wildcards: Select ?name Where { ?m ?name. ?m ?bd. ?m ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), “04-15”) }
Introduction - 4/4 Problems with existing solutions: – they cannot answer SPARQL queries with wildcards in a scalable manner – they cannot handle frequent updates in RDF repositories Answering with subgraph matching – Modeling RDF data and Query as two graphs – Cannot use regular graph pattern matching – Answering SPARQL query ≈ subgraph matching
Preliminaries RDF graph, G, is denoted as G=(V, L V, E, L E ) Query graph, Q, is denoted as Q=(V, L V, E, L E )
G(u 1, u 2,…, u n ) is a match of Q(v 1, v 2,…, v n ) if: – v i is a literal vertex, v i and u i have the same literal value – v i is a class/entity vertex, v i and u i have the same URI – v i is a parameter vertex, there is no constraint over u i – v i is a wildcard vertex, v i is a substring of u i and u i is a literal value – there is an edge from v i to v j in Q with the property p, there is also an edge from u i to u j in G with the same property p Preliminaries Cont’d
Overview of gstore Work directly on RDF graph and SPARQL Query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter and evaluate – Use a false-positive algorithm to prune nodes and obtain a set of candidates; then verify each candidate Use an index (VS ∗ -tree) over the data signature graph (has light maintenance load) for efficient pruning
Storage Scheme & Encoding Technique Storage Scheme
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”)
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) “Abr” “bra” “rah”
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) “Abr” “bra” “rah”
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) “Abr” “bra” “rah” OR
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”)
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) (BornOnDate, " ") (DiedOnDate, " ") (DiedIn, y:Washington DC) OR
Indexing Structure and Query Algorithm
Data Signature Graph G*
Converting Q to Q*
Filter and Evaluate Find matches of Q* over G*(CL) Verify each match in RDF against G(RS)
Generating Candidate List(CL) Two step process: – for each vertex v i ∈ V (Q ∗ ), we find a list R i = {u i1, u i2,..., u in }, where v i &u i= v i, u i ∈ V(G*) and u ij ∈ R i – do a multi-way join to get the candidate list Use S-trees – Height-balanced tree over signatures – Does not support second step - expensive Vs-tree and Vs*-tree – Multi-resolution summary graph based on S-tree – Supports both steps efficiently
S-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d
S-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d
S-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d
S-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d
S-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d & 006
VS-tree Solution d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d11d
VS-tree Solution
VS-tree Solution d 1 1 Xd11d11
VS-tree Solution d 1 2 Xd12d12
VS-tree Solution d 1 3 Xd23d23
VS-tree Solution X002
VS-tree Solution- limitations If this level is dense, many summary matches => More search space Process each level step by step
Possible Optimization Methods “magically” know which level to begin with to minimize the number of summary matches Use DFS(Depth First Search) to find the valid child nodes While inserting vertices, consider not only the hamming distance but also the number of super edges introduced
Optimization example
Experimental results-Exact queries Queries Yago network (20 million triples & size 3.1GB) gStore RDF-3xSW-Storex-RDF-3x BigOWLIM GRIN
Experimental results-Wildcard queries Queries gStoreRDF-3x SW-Store x-RDF-3x BigOWLIM GRIN
Conclusion This approach: – Uses two novel indexes VS-tree and VS*-tree to speed up query processing – Was also to solve the two problems with existing solutions: answers SPARQL queries with wildcards in a scalable manner handle frequent and online updates in RDF repositories
Questions?