GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria
2 Motivation Plenty of large RDF datasets: TAP, GovTrack, ChefMoz, CIA World Factbook Many many more (see rdfdata.org) Query languages: RDQL, RQL, SPARQL DB systems: Jena, Sesame, RDFBroker Indexing? Based on relational database indexes Has to be rooted in the characteristics of the query language
Contributions Lightweight mechanism for indexing large RDF datasets GRIN: Graph-based RDF INdex Query answer algorithms for SPARQL-like queries Evaluation on two real-world datasets: TAP (Stanford) and ChefMoz (chefmoz.org) 3
Outline RDF data and queries The GRIN Index structure Answering queries Experimental evaluation 4
RDF graph example (ChefMoz) 5
RDF query example 6
Query example in SPARQL 7 X SELECT ?v1 ?v2 ?v3 WHERE { {(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)} } FROM ChefMoz
Native RDF systems: Jena2 Stores RDF as (subject, property, value) in a relational table Indexes on each of the three attributes Translates SPARQL/RDQL into SQL 8 X 6 self-joins
Native RDF systems: Sesame Broekstra et al., ISWC 2002 The Sesame SAIL API improves on Jena: Supports RDF Schema inference Separates RDFS from the triple table Supports database schema generation based on the underlying RDF schema of a dataset The problem of too many joins remains 9
Native RDF systems: RDFBroker Sintek et al., ESWC 2006 The database schema is built based on signatures – the set of properties used on a resource Reduces the number of joins between tables 10
The human perspective 11
The human perspective 12
The human perspective 13
The human perspective 14
The human perspective 15
Outline RDF data and queries The GRIN Index structure Answering queries Experimental evaluation 16
GRIN intuition Resources “closer” in the RDF graph are more likely to be part of the same answer Hence they should appear on the same page GRIN will group resources in circles around selected center resources Query evaluation: Find the smallest circle that contains the answer Evaluate query only on resources in that circle 17
The GRIN Index structure GRIN is a binary tree in which: Leaf nodes are sets of resources (and the associated triples) Inner nodes are circles consisting of a center resource and a radius Each node is fully contained in its parent Distance metric: shortest path distance in the undirected graph 18
Building the index: clustering 19
Building the index: clustering 20
Building the index: clustering 21
Building the index: clustering 22
Building the index: clustering Standard k-medoids clustering (Kaufman & Rousseeuw, 1987) How many clusters? R is the set of resources M is the maximum number of resources per page Average link gives the best performance for the inter-cluster distance 23
Building the index: the tree 24
Building the index: the tree 25
Building the index: the tree 26
Outline RDF data and queries The GRIN Index structure Answering queries Experimental evaluation 27
Queries to constraints Extract constraints from the query: d(?v1, Italian) ≤ 1 d(?v2, Norfolk) ≤ 1 d(?v3, Italian) ≤ 2 …and so on 28
Query evaluation 29 Goal: identify the smallest circle that is guaranteed to contain an answer to the query 1. Perform a depth-first traversal 2. For each index node, evaluate the constraints 3. If the constraints guarantee an answer, perform subgraph matching
Query evaluation 30
Evaluating constraints Constraints: d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1, d(?v3, Italian) ≤ 2 Question: is ?v1 in the circle (Grivanti, 3)? d(Grivanti,?v1) ≤ d(Grivanti, Italian) + d(?v1, Italian) ≤ = 2 ?v1 must be in the circle (Grivanti, 3) 31
Evaluating constraints Question: is ?v3 in (Grivanti, 3)? d(Grivanti, ?v3) ≤ d(Grivanti, Italian) + d(Italian, ?v3) ≤ = 3 ?v3 must be in (Grivanti, 3) Similarly, ?v2 is in the same circle 32
Subgraph matching Perform subgraph matching on the resources in the circles guaranteed to contain an answer Algorithm by Cordella et. al, IEEE PAMI 26(10), 2006 Worst-time complexity of O(N!) Where N is the maximum number of nodes in either graph In practice, GRIN makes N very small 33
Outline RDF data and queries The GRIN Index structure Answering queries Experimental evaluation 34
Experimental framework Comparison between GRIN, Sesame, Jena2 and RDFBroker (in-memory) Index build time Memory consumption at query time Query time Two real-world datasets: TAP (Stanford): datasets between 1.5MB and 300MB ChefMoz (chefmoz.org): 220 MB 35
Index build time 36
Memory consumption 37
Query time 38
Average degree of a query node 39
Conclusions Method for indexing large RDF graphs adapted to the characteristics of RDF queries Avoids expensive join operations Gives better query times than Jena2, Sesame and RDFBroker Current and future work: Disk-based index Analysis of overlap and coverage 40