Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL SAND P Triangle Finding: How Graph Theory can Help the Semantic Web Edward Jimenez, Eric Goodman
The Semantic Web as a Graph
Optimizing Queries with Graph Theory Graph theory has a lot to offer the semantic web One example: triangle finding O(|E| 1.5 ) Much more efficient than what a typical database would do. Query2 SELECT ?X, ?Y, ?Z WHERE { ?X rdf:type ub:GraduateStudent. ?Y rdf:type ub:University. ?Z rdf:type ub:Department. ?X ub:memberOf ?Z. ?Z ub:subOrganizationOf ?Y. ?X ub:undergraduateDegreeFrom ?Y} Query9 SELECT ?X, ?Y, ?Z WHERE { ?X rdf:type ub:Student. ?Y rdf:type ub:Faculty. ?Z rdf:type ub:Course. ?X ub:advisor ?Y. ?Y ub:teacherOf ?Z. ?X ub:takesCourse ?Z}
Experiment Compare these three approaches, finding all triangles in a graph Sesame Jena MultiThreaded Graph Library (MTGL) MTGL Open source library of graph algorithms, targeted towards shared memory supercomputers Used MTGL’s implementation of J. Cohen’s triangle finding algorithm Had to modify slightly to allow for multiple edges between vertices.
Data Data: An Recursive Matrix (R-MAT) graph Specify |V| edge factor (average number of edges per vertex) Probabilities a, b, c, d, where a+b+c+d=1. Has properties similar to real-world graphs such as short diameters and small-world properties. Used as basis of Graph500 benchmark. Nodes are given a unique IRI and edges are given a random value. |V| = { } Edge factor: {16, 32, 64} a b c d a b c d
Possible Triangles
Trying to Find Triangles via SPARQL SELECT ?X ?Y ?Z WHERE { {?X ?a ?Y. ?Y ?b ?Z. ?Z ?c ?X } UNION {?Y ?a ?X ?Z ?b ?Y ?X ?c ?Z} UNION {?X ?a ?Y ?Y ?b ?Z ?X ?c ?Z} UNION {?X ?a ?Y. ?Z ?b ?Y. ?X ?c ?Z } UNION {?Y ?a ?X ?Y ?b ?Z ?X ?c ?Z} UNION {?Y ?a ?X ?Z ?b ?Y ?Z ?c ?X} UNION {?X ?a ?Y. ?Z ?b ?Y. ?Z ?c ?X } UNION {?Y ?a ?X ?Y ?b ?Z ?Z ?c ?X}} Redundant Solutions
The Problem: Graph Isomorphism ?X ?Z ?Y iii ?X ?Z ?Y iv ?X = Alice ?Y = Bob ?Z = Charlie Alice Bob Charlie ?X = Alice ?Y = Charlie ?Z = Bob Alice Charlie Bob
The Other Problem: Automorphism ?X ?Z ?Y i Alice Bob Charlie Alice Bob ?X = Alice ?Y = Bob ?Z = Charlie ?X = Charlie ?Y = Alice ?Z = Bob
Possible Triangles
The SPARQL Query SELECT ?X ?Y ?Z WHERE {{ ?X ?a ?Y. ?Y ?b ?Z. ?Z ?c ?X FILTER (STR(?X) < STR(?Y)) FILTER (STR(?Y) < STR(?Z)) } UNION { ?X ?a ?Y. ?Y ?b ?Z. ?Z ?c ?X FILTER (STR(?Y) > STR(?Z)) FILTER (STR(?Z) > STR(?X)) } UNION { ?X ?a ?Y. ?Y ?b ?Z. ?X ?c ?Z }}
Cohen’s Triangle Algorithm Assumptions Simplified graph Completely connected Map 1: O(m) Use v 1 < v 2 < ··· < v n for tie-breaking
Cohen’s Triangle Algorithm Reduce: O(m 3/2 ), … …
Cohen’s Triangle Algorithm Map 2: O(m 3/2 ) Identity mapping of previous reduce step. Map edges v8v8 v8v8 v20v20 v20v20 v1v1 v1v1 v8v8 v8v8 v20v20 v20v20 v3v3 v3v3 v8v8 v8v8 v20v20 v20v20 v2v2 v2v2 bin … v8v8 v8v8 v20v20 v20v20 Reduce 2: O(m 3/2 ) Emit triangles for the contents of each bin when the edge exists between v i and v j.
Results: Growth of Triangles
Results
Comparison at Larger Scales With 1 billion edges, assuming the same constant An O(x 1.39 ) implementation versus an O(x 1.58 ) is 50x faster An O(x 1.39 ) implementation versus an O(x 1.83 ) is 9000x faster
Conclusions The Semantic Web is a graph Graph theory can add a lot in terms of speeding up queries It also has other approaches for analyzing the data SPARQL has unexpected issues when graph isomorphism or automorphisms arise.