Download presentation
Presentation is loading. Please wait.
Published byEmil Cox Modified over 9 years ago
1
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011
2
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cnOutline About Presenter Semantic Web Previous Work New Problem SYSTEM ARCHITECTURE EXPERIMENTS CONCLUSIONS AND FUTURE WORK 2015-8-2 2
3
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn About Presenter Daniel Abadi Associate Professor of Computer Science in Yale University Research – Column-Oriented Database Systems – Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management 2015-8-2 3
4
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Semantic Web The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web 2015-8-2 4
5
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Google Knowledge Graph 2015-8-2 5
6
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Key Technology HTML XML 2015-8-2 6
7
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics. there is no standard way of assigning meaning to tag nesting 2015-8-2 7
8
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn The Disadvantage of Xpath Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember XML is semantically unsatisfactory 2015-8-2 8
9
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn RDF Resource Description Framework 用 Web 标识符(称作统一资源标识符, Uniform Resource Identifiers 或 URIs )来标识事物,用简单的 属性( property )及属性值来描述资源 2015-8-2 9
10
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn RDF as Triples and a Graph 2015-8-2 10
11
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn SPARQL RDF query language A basic graph pattern Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern 2015-8-2 11
12
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Example for Star Pattern Find the names of the strikers that play for FC Barcelona. 2015-8-2 12
13
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Another Example Find football players playing for clubs in a populous region where they were born. 2015-8-2 13
14
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn 2015-8-2 14
15
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Previous Work RDF In RDBMSs Property Tables Vertically Partitioned Approach 2015-8-2 15
16
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn RDF In RDBMSs Get the title of the book(s) Joe Fox wrote in 2001 2015-8-2 16
17
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Property Tables 2015-8-2 17
18
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Vertically Partitioned Approach 2015-8-2 18
19
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn New Problem Single node RDF management systems are abundant –Sesame –Jena –RDF3X –3store Research in clustered RDF management is less significantly explored: The focus of the talk 2015-8-2 19
20
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn SYSTEM ARCHITECTURE 2015-8-2 20
21
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Graph Partitioning Hash vs. Graph partitioning –Hash: Only efficient for star patterns –Graph: Taking advantage of graph model 2015-8-2 21
22
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Graph Partitioning Edge vs. Vertex partitioning –Edge: Natural but inefficient for query execution –Vertex: Superior for common graph patterns 2015-8-2 22
23
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Vertex Partitioning Preprocess –remove triples whose predicate is rdf:type METIS partitioner 2015-8-2 23
24
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Triple Placement Minimizing data shuffling/exchange –Allowing data overlap N-hop guarantee –The extent of data overlap –If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine 2015-8-2 24
25
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn DIRECTED N-HOP GUARANTEE 2015-8-2 25
26
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn A potential problem triples (s, p, o) and (o, p’, o’) –2-hop guarantee triples (s, p, o) and (s’, p’, o) –not guaranteed “object-connected” is not unusual undirected n-hop guarantee 2015-8-2 26
27
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Triple Placement Algorithm 2015-8-2 27
28
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Query Processing Queries are executed in RDF-stores and/or Hadoop Query execution is more efficient in RDF- stores than in Hadoop –Pushing as much of the processing as possible into RDF-stores –Minimizing the number of Hadoop jobs –The larger the hop guarantee, the more work is done in RDF-stores 2015-8-2 28
29
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn To Communicate, or not to Communicate Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed? –Choose the “center” of the query graph –Calculate the distance from the “center” to the furthest edge –If distance > n, communication is needed; not needed otherwise 2015-8-2 29
30
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Determining whether a Query is PWOC PWOC Query – parallelizable without communication DoFE – distance of farthest edge – the vertex in a graph with the smallest DoFE will be the most central in a graph 2015-8-2 30
31
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn The algorithm 2015-8-2 31
32
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn the issue of duplicate results naive approach –remove duplicates after the query has completed owner-computes model –add triples (v, ‘ ’, ‘Yes’) to a partition –For each query issued to the RDF-stores, add an additional pattern (core, ‘ ’, ‘Yes’) 2015-8-2 32
33
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn A query is not PWOC decompose the query into PWOC subqueries use Hadoop jobs to join the results of the PWOC subqueries The number of Hadoop jobs required to complete the query increases as the number of subqueries increases 2015-8-2 33
34
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn minimal number of subqueries reduces to the problem of finding minimal edge partitioning of a graph into subgraphs of bounded diameter brute-force 2015-8-2 34
35
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Examlple 2015-8-2 35 DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1 the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,
36
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Decompose Example 2015-8-2 36
37
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn EXPERIMENTS 20-machine cluster Leigh University Benchmark (LUBM): 270 million triples Competitors: –Single-node RDF-3X –SHARD: triple-store system in Hadoop –Graph partitioning (the proposed system) –Hash partitioning on subjects 2015-8-2 37
38
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Data Load Time 2015-8-2 38
39
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Performance Comparison 2015-8-2 39
40
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Varying Number of Machines 2015-8-2 40
41
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn Summary 2015-8-2 41
42
Graph Data Management Lab, School of Computer ScienceGDM@FUDANGDM@FUDANhttp://gdm.fudan.edu.cn 2015-8-2 42
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.