DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Introduction to Algorithms
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
1 Networking through Linux Partha Sarathi Dasgupta MIS Group Indian Institute of Management Calcutta.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
The Shortest Path Problem
Graph Data Management Lab, School of Computer Scalable SPARQL Querying of Large RDF Graphs Xu Bo
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
GRIN – A Graph Based RDF Index Octavian Udrea Andrea Pugliese V. S. Subrahmanian Presented by Tulika Thakur.
The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each decision is locally optimal. These.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Maze Routing Algorithms with Exact Matching Constraints for Analog and Mixed Signal Designs M. M. Ozdal and R. F. Hentschke Intel Corporation ICCAD 2012.
Foundation of Computing Systems
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.
Knowledge Representation Fall 2013 COMP3710 Artificial Intelligence Computing Science Thompson Rivers University.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
Graph Indexing From managing and mining graph data.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Cohesive Subgraph Computation over Large Graphs
Knowledge Representation
Multiway Search Trees Data may not fit into main memory
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Minimum Spanning Tree 8/7/2018 4:26 AM
Spatial Indexing I Point Access Methods.
Design and Analysis of Algorithm
B+ Tree.
Probabilistic Data Management
Program based on pointers in C.
CS 3343: Analysis of Algorithms
CS 3343: Analysis of Algorithms
Spatio-temporal Pattern Queries
Short paths and spanning trees
CSE 421: Introduction to Algorithms
B-Trees CSE 373 Data Structures CSE AU B-Trees.
B-Trees CSE 373 Data Structures CSE AU B-Trees.
Minimum Spanning Trees
CSE 373, Copyright S. Tanimoto, 2002 B-Trees -
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Algorithms (2IL15) – Lecture 7
Knowledge Representation
B-Trees CSE 373 Data Structures CSE AU B-Trees.
Important Problem Types and Fundamental Data Structures
Switching Lemmas and Proof Complexity
Chapter 9 Graph algorithms
Locality In Distributed Graph Algorithms
Minimum Spanning Trees
Presentation transcript:

DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases Matthias Brocheler, Andrea Pugliese, and V.S. Subrahmanian University of Maryland, USA Universit` a della Calabria, Italy Yuval Gil

Introduction DOGMA index Basic algorithm for graph queries Advanced algorithm for graph queries DOGMA_ipd and DOGAM_epd Experimental results Future work Conclusion

Introduction RDF is becoming an increasingly important paradigm for Web knowledge representation. We need to store this data and efficiently query massive RDF datasets. More and more RDF datasets are represented by graphs.

Introduction

Formality RDF triple (subject,property,value) RDF database R is a finite set of RDF triples

Formality(2) RDF graph where , is a mapping s.t. for all

Graph Queries we assume the existence of some set VAR of variable symbols Start with “?” A graph query is any graph where

RDF Example חוק שהנושא שלו הוא בריאות, שהספונסר שלו הוא גבר,ושיש לו תיקון לחוק של קלרה בונס

substitution A substitution for query Q is a mapping A substitution maps all variable vertices in query Q to either a subject or a value If θ is a substitution for query Q, then Qθ denotes the replacement of all variables ?v in by θ(?v) Substitution = החלפה

answer A substitution θ is an answer for query Q with respect to database R iff Qθ is a subgraph of The answer set for query Q w.r.t. an RDF database R is the set

Example ?v1 Male Gender Query: Bill B0054 Male Gender Substitution: Peter Traves Male Gender Answer: Answer Set = {Peter Traves, Jeff Ryser, Pierce Dickes, Keith Farmer}

K-merge Graph Suppose G is an RDF graph and are two RDF graphs s.t. and k is an integer s.t. then graph is a k-merge of graphs iff: 1) 2) There is a surjective mapping called the merge mapping such that: and iff there exist

DOGMA index A DOGMA index of order k (k ≥ 2) is a binary tree D with the following properties: 1. Each node in D equals the size of a disk page and is labeled by a graph. 2. The labels of the set of leaf nodes of D constitute a partition of . 3. If node N is the parent of nodes , then the graph labeling node N is a k-merge of the graphs labeling its children. 4. D is balanced.

DOGMA index Many different DOGMA indexes can be constructed for the same RDF database. We want to find a DOGMA index with as few “cross” edges between sub-graphs stored on different pages as possible

DOGMA index For building the DOGMA index they used an external graph partitioning algorithm that minimize edge crossing (the GGGP graph partitioning algorithm) given a weighted graph, partitions its vertex set in such a way that: The total weight of all edges crossing the partition is minimized the accumulated vertex weights are (approximately) equal for both partitions

Basic Algorithm Assuming existence of two index retrieval functions: retrieveNeighbors(D,v,l) that retrieves from DOGMA index D the neighbors of v restricted to label l retrieveVertex(D,v) that retrieves from D a complete description of vertex v

Basic Algorithm DOGMA_basic is recursive For each variable vertex v in Q, the algorithm maintains a set of constant vertices (called result candidates) Any substitution θ must be such that θ(v) is a neighbor of constant vertex c in Q through an edge labeled by l. Therefore is the set of all neighbors of c in reachable by an edge labeled l

Basic Algorithm We use the DOGMA index D to efficiently retrieve the neighbors of c. If v is connected to multiple constant vertices, we take the intersection of the respective constraints on the result candidates

Basic Algorithm Greedily choose the variable vertex w with the smallest set of result candidates. If the set of result candidates is empty, then we know that θ cannot be extended to an answer substitution. We deriving extended substitutions θ’ from θ which assign to w and calling DOGMA basic recursively on θ’.

Example

Basic Algorithm The worst-case complexity of the DOGMA_basic algorithm is: The algorithm is exponential in the number of variables in the query.

Basic Algorithm How can we improve this algorithm performance? Instead of using only “short range” dependencies, use also “long range” dependencies.

Advance Algorithm For every variable v the algorithm maintain a set that contain all distance constraints that arise from long range dependencies After creating , the algorithm removes all the candidates in that don’t satisfy the distance constraints in .

Advance Algorithm Using distance constraints will reduce the size of , (i.e. by removing candidates from ) and hence the number of extensions to θ we have to consider. The improved algorithm (i.e. DOGMA_adv) assumes the existence of a distance index to efficiently look up for the shortest path between two vertex u,v.

Advance Algorithm But how can we reach this information? Computing graph distances at query time is clearly inefficient. Lets expand the DOGMA index to include information about distances.

Long Range Dependencies We don’t have to know the exact distance between two vertexes u,v. המרחק בין Bill B0744 ל- Health Care צריך להיות מקסימום 2. אבל לא צריך לדעת את המרחק המדויק שלהם מספיק לדעת את המרחק המינימאלי לאיזה שהוא קודקוד שלא בקבוצה (page) שבו נמצא Bill B0744. אם הוא יותר גדול מ-2 אז אפשר להוריד את Bill B0744 מהרשימה.

DOGMA_ipd(Internal Partition Distance) DOGMA_ipd is giving this information. For every vertex v and node N DOGMA_ipd stores the distance to the “outside world”.

DOGMA_ipd Then in query time we want to know the distance from v to u. We indentify the “splitting vertexes” in the index tree. We take the max of the minimum distances.

Example N1 and N2 are the “splitting vertexes” The minimal distance from “A0056” to some v in N2 leafs is 3. The minimal distance from “Health Care” to some v in N1 leafs is 2. Max(3,2) = 3 and therefore we can remove A0056.

DOGMA_epd(external Partition Distance) In DOGMA_ipd we store for each vertex the minimal distance to the “outside world” In DOGMA_epd we store for each vertex the minimal distance to each “country” in the “outside world”. ע"י צביעה של כל קודקוד, אנחנו נחפש את המרחק המינימלי בין u לקודקוד המינימלי באותו הצבע מס' הצבעים סופי, השתמשו באלגוריתם שממקסם את מס' הצבעים שאפר להשתמש בהם על ידי כך שקודקודים מאוד רחוקים (סבירות נמוכה שהם ייפגשו) יהיה להם את אותו הצבע.

Experimental Results They build a system that implement the DOGMA_adv algorithm combined with DOGMA_ipd and DOGMA_epd indexes. They compared the performance of their algorithm and indexes with 4 leading RDF database systems Sesame2 Jena2 JenaTDB OWLIM OWLIM טוענת את כל ה-DB לזיכרון הראשי לפני חישוב השאילתה ולכן לכאורה יש לה יתרון על שאר המערכות. Jena2 מבוסס JAVA

Experimental Results They use three differentRDF datasets Flickr social network 16 million triples well connected GovTrack 14.5 million triples The Lehigh University Benchmark (LUBM) 13.5 million triple Contains small connected subgraphs Auto generated

Experimental Results They designed a set of graph queries with varying complexity Constant vertices were chosen randomly Queries with an empty result set were filtered out. Queries were grouped into classes based on the number of edges and variable vertices They averaged the query times of all queries in each class

Experimental Results They test it on a machine with a 2.4Ghz Intel Core 2 processor and 3GB of RAM In a first round of experiments, They designed several relatively simple graph queries, containing no more than 6 edges.

סקלת הזמן לוגריתמית מקומות ריקים == המערכת לא סיימה בזמן סביר OWLIM איטית כי היא מעלה את כל הDB לזיכרון לפני השאילתה ב-social וב- gov ככל שהסיבוכיות עולה כך ההפרשים עולים לעומת המתחרים Sesame2 טובה ב-LUBM

Experimental Results In the second round of experiments, they increased the complexity of the queries Up to 24 edges OWLIM, JenaTDB, and Jena2 systems did not manage to complete the evaluation of these queries in reasonable time On the GovTrack and social network dataset, DOGMA_ipd and DOGMA_epd have better performance up to 40000% over Sesame2.

Results – storage requirements

Future work Study of the advantages and disadvantages of each of the proposed indexes when dealing with particular queries and RDF datasets. Extend the indexes to support efficient updates.

Conclusion We saw the DOGMA index and algorithms that use this index to perform an efficient graph queries.

Good luck with your exams! Thank You! Good luck with your exams!