Download presentation
Presentation is loading. Please wait.
Published byJordan Burns Modified over 9 years ago
1
gStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao { zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk, tamer.ozsu@uwaterloo.ca
2
Agenda Introduction Preliminaries Overview of gStore Storage Scheme and Encoding Technique Indexing Structure and Query Algorithm Optimized methods Experiments and their results Conclusions
3
Introduction -1/4 What is RDF? – Building block of semantic web – Represented as a collection of triples : (Subject,Property,Object) Prefix: y=http://en.wikipedia.org/wiki/ SubjectPropertyObject y:Abraham LincolnhasNameAbraham Lincoln y:Abraham LincolnBornOnDate1809-02-12 y:Abraham LincolnDiedOnDate1865-04-15 y:Abraham LincolnDiedIny:Washington_D.C hasName“Washington D.C” y:Washington_D.CFoundYear1790 y:Washington_D.Crdf:typey:city y:United_StateshasName“United States” y:United_StateshasCapitaly:Washington_D.C y:United_Statesrdf:typeCountry y:Reese_Witherspoonrdf:typey:Actor y:Reese_WitherspoonBornOnDate“1976-03-22” y:Reese_WitherspoonBornIny:New_Orleans_Louisiana y:Reese_WitherspoonhasName“Reese Witherspoon” y:New_Orleans_LouisianaFoundYear1718 y:New_Orleans_Louisianardf:typey:city y:New_Orleans_LouisianalocatedIny:United_States
4
Introduction 2/4:RDF Graph
5
Introduction - 3/4 What is SPARQL? Sample query: Select ?name Where { ?m ?name. ?m “1809-02-12” ?m “1865-04-15” } Query with wildcards: Select ?name Where { ?m ?name. ?m ?bd. ?m ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), “04-15”) }
6
Introduction - 4/4 Problems with existing solutions: – they cannot answer SPARQL queries with wildcards in a scalable manner – they cannot handle frequent updates in RDF repositories Answering with subgraph matching – Modeling RDF data and Query as two graphs – Cannot use regular graph pattern matching – Answering SPARQL query ≈ subgraph matching
7
Preliminaries RDF graph, G, is denoted as G=(V, L V, E, L E ) Query graph, Q, is denoted as Q=(V, L V, E, L E )
8
G(u 1, u 2,…, u n ) is a match of Q(v 1, v 2,…, v n ) if: – v i is a literal vertex, v i and u i have the same literal value – v i is a class/entity vertex, v i and u i have the same URI – v i is a parameter vertex, there is no constraint over u i – v i is a wildcard vertex, v i is a substring of u i and u i is a literal value – there is an edge from v i to v j in Q with the property p, there is also an edge from u i to u j in G with the same property p Preliminaries Cont’d
9
Overview of gstore Work directly on RDF graph and SPARQL Query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter and evaluate – Use a false-positive algorithm to prune nodes and obtain a set of candidates; then verify each candidate Use an index (VS ∗ -tree) over the data signature graph (has light maintenance load) for efficient pruning
10
Storage Scheme & Encoding Technique Storage Scheme
11
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000
12
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah”
13
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah” 0000 0100 0000 0000 1000 0000 0000 0000 0000 0000 0100 0000
14
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah” 0000 0100 0000 0000 1000 0000 0000 0000 0000 0000 0100 0000 OR 1000 0100 0100 0000
15
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 1000 0100 0100 0000
16
Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0010 0000 0000 1000 0100 0100 0000 (BornOnDate, "1908-02-12") 0100 0000 0000 0100 0010 0100 1000 (DiedOnDate, "1965-04-15") 0000 1000 0000 0000 0010 0100 0000 (DiedIn, y:Washington DC) 0000 0010 0000 1000 0010 0100 0001 0110 1010 0000 1100 0110 0100 1001 OR
17
Indexing Structure and Query Algorithm
18
Data Signature Graph G*
19
Converting Q to Q*
20
Filter and Evaluate Find matches of Q* over G*(CL) Verify each match in RDF against G(RS)
21
Generating Candidate List(CL) Two step process: – for each vertex v i ∈ V (Q ∗ ), we find a list R i = {u i1, u i2,..., u in }, where v i &u i= v i, u i ∈ V(G*) and u ij ∈ R i – do a multi-way join to get the candidate list Use S-trees – Height-balanced tree over signatures – Does not support second step - expensive Vs-tree and Vs*-tree – Multi-resolution summary graph based on S-tree – Supports both steps efficiently
22
S-tree Solution 001 002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000
23
S-tree Solution 001 002003004 005 007 008 006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000 001 004 006
24
S-tree Solution 001002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 00010001 1000 0000 0001 0100 0001 0100 1000 0010 1001 1100 01001001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 10001000 0000 10000 001 004 006 002 003 006
25
S-tree Solution 001002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 00010001 1000 0000 0001 0100 0001 0100 1000 0010 1001 1100 01001001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 10001000 0000 10000 001 004 006 002 003 006
26
S-tree Solution 001 002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000 001 004 006 002 003 & 006
27
VS-tree Solution 1110 1101 1001 1101 0010 1001 1100 0100 1001 0101 1001 1000 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 001 002 003 004 005 006 007008 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d11d11 11111 1001000110 00001 10010 01000 01011 00010 0010000010 10000 00010 01000 00010 00100 00010
28
VS-tree Solution 0000 10001000 0000 10000
29
VS-tree Solution 0000 10001000 0000 10000 d 1 1 Xd11d11
30
VS-tree Solution 0000 10001000 0000 10000 d 1 2 Xd12d12
31
VS-tree Solution 0000 10001000 0000 10000 d 1 3 Xd23d23
32
VS-tree Solution 0000 10001000 0000 10000 001 X002
33
VS-tree Solution- limitations 0000 10001000 0000 10000 If this level is dense, many summary matches => More search space Process each level step by step
34
Possible Optimization Methods “magically” know which level to begin with to minimize the number of summary matches Use DFS(Depth First Search) to find the valid child nodes While inserting vertices, consider not only the hamming distance but also the number of super edges introduced
35
Optimization example
36
Experimental results-Exact queries Queries Yago network (20 million triples & size 3.1GB) gStore RDF-3xSW-Storex-RDF-3x BigOWLIM GRIN
37
Experimental results-Wildcard queries Queries gStoreRDF-3x SW-Store x-RDF-3x BigOWLIM GRIN
38
Conclusion This approach: – Uses two novel indexes VS-tree and VS*-tree to speed up query processing – Was also to solve the two problems with existing solutions: answers SPARQL queries with wildcards in a scalable manner handle frequent and online updates in RDF repositories
39
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.