Presentation is loading. Please wait.

Presentation is loading. Please wait.

DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases

Similar presentations


Presentation on theme: "DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases"— Presentation transcript:

1 DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases
Matthias Brocheler, Andrea Pugliese, and V.S. Subrahmanian University of Maryland, USA Universit` a della Calabria, Italy Yuval Gil

2 Introduction DOGMA index Basic algorithm for graph queries Advanced algorithm for graph queries DOGMA_ipd and DOGAM_epd Experimental results Future work Conclusion

3 Introduction RDF is becoming an increasingly important paradigm for Web knowledge representation. We need to store this data and efficiently query massive RDF datasets. More and more RDF datasets are represented by graphs.

4 Introduction

5 Formality RDF triple (subject,property,value)
RDF database R is a finite set of RDF triples

6 Formality(2) RDF graph where , is a mapping s.t. for all

7 Graph Queries we assume the existence of some set VAR of variable symbols Start with “?” A graph query is any graph where

8 RDF Example חוק שהנושא שלו הוא בריאות, שהספונסר שלו הוא גבר,ושיש לו תיקון לחוק של קלרה בונס

9 substitution A substitution for query Q is a mapping
A substitution maps all variable vertices in query Q to either a subject or a value If θ is a substitution for query Q, then Qθ denotes the replacement of all variables ?v in by θ(?v) Substitution = החלפה

10 answer A substitution θ is an answer for query Q
with respect to database R iff Qθ is a subgraph of The answer set for query Q w.r.t. an RDF database R is the set

11 Example ?v1 Male Gender Query: Bill B0054 Male Gender Substitution: Peter Traves Male Gender Answer: Answer Set = {Peter Traves, Jeff Ryser, Pierce Dickes, Keith Farmer}

12 K-merge Graph Suppose G is an RDF graph and are two RDF graphs
s.t and k is an integer s.t then graph is a k-merge of graphs iff: 1) 2) There is a surjective mapping called the merge mapping such that: and iff there exist

13 DOGMA index A DOGMA index of order k (k ≥ 2) is a binary tree D with the following properties: 1. Each node in D equals the size of a disk page and is labeled by a graph. 2. The labels of the set of leaf nodes of D constitute a partition of 3. If node N is the parent of nodes , then the graph labeling node N is a k-merge of the graphs labeling its children. 4. D is balanced.

14

15

16 DOGMA index Many different DOGMA indexes can be constructed for the same RDF database. We want to find a DOGMA index with as few “cross” edges between sub-graphs stored on different pages as possible

17 DOGMA index For building the DOGMA index they used an external graph partitioning algorithm that minimize edge crossing (the GGGP graph partitioning algorithm) given a weighted graph, partitions its vertex set in such a way that: The total weight of all edges crossing the partition is minimized the accumulated vertex weights are (approximately) equal for both partitions

18

19 Basic Algorithm Assuming existence of two index retrieval functions:
retrieveNeighbors(D,v,l) that retrieves from DOGMA index D the neighbors of v restricted to label l retrieveVertex(D,v) that retrieves from D a complete description of vertex v

20 Basic Algorithm DOGMA_basic is recursive
For each variable vertex v in Q, the algorithm maintains a set of constant vertices (called result candidates) Any substitution θ must be such that θ(v) is a neighbor of constant vertex c in Q through an edge labeled by l. Therefore is the set of all neighbors of c in reachable by an edge labeled l

21 Basic Algorithm We use the DOGMA index D to efficiently retrieve the neighbors of c. If v is connected to multiple constant vertices, we take the intersection of the respective constraints on the result candidates

22 Basic Algorithm Greedily choose the variable vertex w with the smallest set of result candidates. If the set of result candidates is empty, then we know that θ cannot be extended to an answer substitution. We deriving extended substitutions θ’ from θ which assign to w and calling DOGMA basic recursively on θ’.

23 Example

24 Basic Algorithm The worst-case complexity of the DOGMA_basic algorithm is: The algorithm is exponential in the number of variables in the query.

25

26 Basic Algorithm How can we improve this algorithm performance?
Instead of using only “short range” dependencies, use also “long range” dependencies.

27 Advance Algorithm For every variable v the algorithm maintain a set that contain all distance constraints that arise from long range dependencies After creating , the algorithm removes all the candidates in that don’t satisfy the distance constraints in .

28 Advance Algorithm Using distance constraints will reduce the size of , (i.e. by removing candidates from ) and hence the number of extensions to θ we have to consider. The improved algorithm (i.e. DOGMA_adv) assumes the existence of a distance index to efficiently look up for the shortest path between two vertex u,v.

29 Advance Algorithm But how can we reach this information?
Computing graph distances at query time is clearly inefficient. Lets expand the DOGMA index to include information about distances.

30 Long Range Dependencies
We don’t have to know the exact distance between two vertexes u,v. המרחק בין Bill B0744 ל- Health Care צריך להיות מקסימום 2. אבל לא צריך לדעת את המרחק המדויק שלהם מספיק לדעת את המרחק המינימאלי לאיזה שהוא קודקוד שלא בקבוצה (page) שבו נמצא Bill B0744. אם הוא יותר גדול מ-2 אז אפשר להוריד את Bill B0744 מהרשימה.

31 DOGMA_ipd(Internal Partition Distance)
DOGMA_ipd is giving this information. For every vertex v and node N DOGMA_ipd stores the distance to the “outside world”.

32 DOGMA_ipd Then in query time we want to know the distance from v to u. We indentify the “splitting vertexes” in the index tree. We take the max of the minimum distances.

33 Example N1 and N2 are the “splitting vertexes”
The minimal distance from “A0056” to some v in N2 leafs is 3. The minimal distance from “Health Care” to some v in N1 leafs is 2. Max(3,2) = 3 and therefore we can remove A0056.

34

35 DOGMA_epd(external Partition Distance)
In DOGMA_ipd we store for each vertex the minimal distance to the “outside world” In DOGMA_epd we store for each vertex the minimal distance to each “country” in the “outside world”. ע"י צביעה של כל קודקוד, אנחנו נחפש את המרחק המינימלי בין u לקודקוד המינימלי באותו הצבע מס' הצבעים סופי, השתמשו באלגוריתם שממקסם את מס' הצבעים שאפר להשתמש בהם על ידי כך שקודקודים מאוד רחוקים (סבירות נמוכה שהם ייפגשו) יהיה להם את אותו הצבע.

36 Experimental Results They build a system that implement the DOGMA_adv algorithm combined with DOGMA_ipd and DOGMA_epd indexes. They compared the performance of their algorithm and indexes with 4 leading RDF database systems Sesame2 Jena2 JenaTDB OWLIM OWLIM טוענת את כל ה-DB לזיכרון הראשי לפני חישוב השאילתה ולכן לכאורה יש לה יתרון על שאר המערכות. Jena2 מבוסס JAVA

37 Experimental Results They use three differentRDF datasets
Flickr social network 16 million triples well connected GovTrack 14.5 million triples The Lehigh University Benchmark (LUBM) 13.5 million triple Contains small connected subgraphs Auto generated

38 Experimental Results They designed a set of graph queries with varying complexity Constant vertices were chosen randomly Queries with an empty result set were filtered out. Queries were grouped into classes based on the number of edges and variable vertices They averaged the query times of all queries in each class

39 Experimental Results They test it on a machine with a 2.4Ghz Intel Core 2 processor and 3GB of RAM In a first round of experiments, They designed several relatively simple graph queries, containing no more than 6 edges.

40 סקלת הזמן לוגריתמית מקומות ריקים == המערכת לא סיימה בזמן סביר OWLIM איטית כי היא מעלה את כל הDB לזיכרון לפני השאילתה ב-social וב- gov ככל שהסיבוכיות עולה כך ההפרשים עולים לעומת המתחרים Sesame2 טובה ב-LUBM

41 Experimental Results In the second round of experiments, they increased the complexity of the queries Up to 24 edges OWLIM, JenaTDB, and Jena2 systems did not manage to complete the evaluation of these queries in reasonable time On the GovTrack and social network dataset, DOGMA_ipd and DOGMA_epd have better performance up to 40000% over Sesame2.

42

43 Results – storage requirements

44 Future work Study of the advantages and disadvantages of each of the proposed indexes when dealing with particular queries and RDF datasets. Extend the indexes to support efficient updates.

45 Conclusion We saw the DOGMA index and algorithms that use this index to perform an efficient graph queries.

46 Good luck with your exams!
Thank You! Good luck with your exams!


Download ppt "DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases"

Similar presentations


Ads by Google