GRIN – A Graph Based RDF Index
- Indexing mechanism for Graph based Queries. -GRIN : a tree data structure. -Large RDF datasets used : TAP, ChefMoz -Comparison with DB systems: Jena, Sesame, RDFBroker - Measure parameters - 1) Size of Index 2) Time taken to answer graph queries 3) Time taken to build the index
RDF Graph Example (extracted from ChefMox dataset)
RDF Graph Representation : An RDF triple has the form (s, p, v) where s ∈ U, p ∈ Up, v ∈ R. U denote a set whose elements are called URI References. L denote a set whose elements are called literals. Up ⊆ U denotes the set of properties. R = U ∪ L denotes the set of resources
Introduction to P-path Given an RDF graph D and a set P ⊆ Up, a P-path in D is a set {e1,..., eq}, with ej = (sj, pj, vj), such that ∀ j ∈ [1, q] ej ∈ D; ∀ j ∈ [1, q − 1] vj = sj+1; ∀ j ∈ [1, q] pj ∈ P. Intuitively, a P-path is a path in the RDF graph whose edge labels are all drawn from the set P. For Example Let P = {location, locatedIn}. The triples (ColdStone, location, Lincoln) and (Lincoln, locatedIn,NE/USA) constitute a P-path of length two in the graph.
Introduction to P-path P = {location,locat edIn} d(ColdStone,NE/USA) = 2 Triples = (ColdStone, location, Lincoln) and (Lincoln, locatedIn,NE/U SA)
RDF Graph Query An RDF graphical query is a 4-tuple (N, V, E, λn) where: N is a set of vertices; V is a set of variables; E = Es ∪ Ed is a set of edges, where E s ⊆ N × N × (V ∪ Up) and E d ⊆ N × N × 2^U p × IN. We call E s the set of single edges and Ed the set of double edges. λn : N →R ∪ V is a vertex labeling function.
RDF Graph Query The query can be expressed in SPARQL as: SELECT ?v1 ?v2 ?v3 WHERE {{(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)}} P-path
GRIN Index -Resources “closer” in the RDF graph are more likely to be part of the same answer Hence they should appear on the same page. -GRIN will group resources in circles around selected center resources -Query evaluation: Find the smallest circle that contains the answer -Evaluate query only on resources in that circle
Building a GRIN Index A GRIN index is a balanced binary tree such that: Each leaf node contains a set N l ⊆ Rof nodes s.t. for all leaf nodes l != l', N l ∩ N l' = ∅ ; Each non-leaf node t contains a pair (c, r), with c ∈ R and r ∈ IN. This is a very succinct representation of the set of resources in the graph at distance at most r of the resource c. We write this set as Nt = {c' ∈ R|d(c, c') ≤ r}. For any nodes x, y in the tree such that x is a parent of y, Nx ⊇ Ny.
Building a GRIN Index M = maximum number of RDF graph vertices per page. C = number of leaf nodes. |R|/C <= M d c = inter cluster distance function (i) Single link defines d c (S, S') = Min (d c (x, y)) where x ∈ S,y ∈ S' (ii) Complete link defines d c (S, S') = Max (d c (x, y)) where x ∈ S,y ∈ S' (iii) Average link defines d c (S, S') = (Σ(d c (x,y)))/ ( |S|×|S'| ) Where x ∈ S,y ∈ S'
Building a GRIN Index Cluster the vertices in C disjoint Sets using PAM Clustering algorithm. Repeat untill equilibrium is reached? For each intermediate leve, GRINBuld chooses a random node u, Computes its closest node v, and assignes a parent node (c,r) where c is selected from Nu U Nv
Query Answering Derive Contraints from the query. Evaluate constraints against the nodes of GRIN Index d(?v1,NE/USA) ≤ 2, d(?v2, NE/USA) ≤ 2, d(?v2, Norfolk) ≤ 1), d(?v1, Norfolk) ≤ 3, d(?v1, Italian) ≤ 1, d(?v2, Italian) ≤1, d(?v3, NE/USA) ≤ 3, d(?v3, Norfolk) ≤ 2, d(?v3, Italian) ≤ 2.
Query Answering For any given node, REJECT or ACCEPT it. 1: Reject circle(c,r) if any constant in query is outside the circle 2: Reject circle(c,r) if we cannot guarantee that every variable in inside the circle. Is ?v1 in circle (Grivanti, 2)? d(Grivanti,?v1) ≤ d(Grivanti,Italian)+d(?v1,Italian) ≤ 2 So ?v1 can be satisfied.
RDF System : GRIN Does not store the data in the index, but points to it. The data is stored in a hash table. Only one computationaly iintensive operation – Clustering the leaf nodes. For 300MB data, indexi stored in 75MB and 320 MB is used for the hash table.
RDF System : Jena Stores RDF as (subject, property, value) in a relational table. Indexes on each of the three attributes. Translates SPARQL/RDQL into SQL. Too many self joins. Used 403MB for indexing on 300MB data.
RDF System : Sesame Supports RDF Schema inference Separates RDFS from the triple table Supports database schema generation based on the underlying RDF schema of a dataset The problem of too many joins still remain. Used 825MB for indexing on 300MB data.
RDF System : RDF Broker The database schema is built based on signatures – the set of properties used on a resource. Reduces the number of joins between tables. Used 950MB for indexing on 300MB data.
Discussiom Vertices in GRIN = resources in underlying RDF. Resources can be atmost |R|. Therefore, number of leaf nodes = O|R| GRIN s a binary tree, so height of tree = O(log 2 |R|) Worst Case complexity for index building = O(|R|^ 4* log 2 (|R|) ) Good for small sized data only.
Discussion Time complexity for Query Answering : Best Case - O(N) Worst Case - O(N!) Where N is the total number of vertices in the graphs to be matched, “Our experimental results show that GRINAnswer is often faster than Jena, Sesame and RDFBroker for certain types of graph-based queries.”
Discussion The query can be expressed in SPARQL as: SELECT ?v1 ?v2 ?v3 WHERE {{(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)}} No Way to represent P-path in SPARQL !! P-path