Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)
Reachability Query ?Query(1,11) Yes ?Query(3,9) No The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? Directed Graph DAG (directed acyclic graph) by coalescing the strongly connected components
Applications XML Biological networks Ontology Knowledge representation (Lattice operation) Object programming (Class relationship) Distributed systems (Reachable states) Graph Databases
MethodQuery timeConstructionIndex size DFS/BFSO(n+m) Transitive ClosureO(1)O(nm)/O(n 3 )O(n 2 ) Optimal Chain Cover (Jagadish, TODS’90) O(k)O(nm)O(nk) Optimal Tree Cover (Agrawal et al., SIGMOD’89) O(n)O(nm)O(n 2 ) Dual-Labeling (Wang et al., ICDE’06) O(1)O(n+m+t 3 )O(n+t 2 ) Labeling+SSPI (Chen et al., VLDB’05) O(m-n)O(n+m) GRIPP (Triβl et al., SIGMOD’07) O(m-n)O(n+m) Prior Work 2-HOP (O(nm 1/2 ), and O(n 4 )), HOPI, and heuristic algorithms
Limitation of Tree-based approaches Finding a good tree cover is expensive Tree cover cannot represent some common types of DAGs, like Grid Compression limitations –Chain (1-parent, 1-child) –Tree (1-parent, multiple children) –Most existing methods which utilize the tree cover are greatly affected by how many edges are left uncovered
Overview of Path-Tree Chain->Tree->Path-Tree (2 parents / multiple children) Path-tree cover is a spanning subgraph of G in a tree shape (T) A node in the tree T corresponds to a path in G and an edge in T corresponds to the edges between two paths in G 3-tuple labeling exists for any path-tree to answer reachability query in O(1)
Path-Tree in a Nutshell P1 P2 P3 P4 P1 P2 P3 P4 Path-Graph is not necessarily a planar graph The reachability between any two nodes can be answered in O(1)
Key Problems How to construct a path-tree? –Algorithm How can a path-tree help with reachability queries? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality
Constructing Path-Tree Step 1: Path-Decomposition of DAG Step 2: Minimal Equivalent Edge Set between any two paths Step 3: Path-Graph Construction Step 4: Path-Tree Cover Extraction
Step 1: Path-Decomposition P1 P2 P3 P4 (PID,SID) =(2, 5) For any two nodes (u, v) in the same path, u v if and only if (u.sid v.sid) Simple linear algorithm based on topological sort can achieve a path-decomposition
Step 2: Minimal equivalent edge set P1 P2 P1 P2 The reachability between any two paths can be captured by a unique minimal set of edges P1 P2 P1 P2 The edges in the minimal equivalent edge set do not cross (always parallel)!
Step 3: Path-Graph Construction P1 P2 P3 P4 P1 P2 P3 P Weighted Directed Path-Graph Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge
Step 4: Extracting Path-Tree Cover P1 P2 P3 P Weighted Directed Path-Graph P1 P2 P3 P Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk )
Key Problems How to construct a path-tree? –Algorithm How can path-tree help with reachability queries? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality
3-Tuple Labeling for Reachability P1 P2 P3 P4 P1 P2 P3 P4 DFS labeling (1-tuple) Interval labeling (2-tuple) High-level description about paths Pi Pj ? [1,1] [2,2] [1,3] [1,4]
DFS labeling P1 P2 P3 P4 1.Starting from the first vertex in the root-path 2.Always try to visit the next vertex in the same path 3.Label a node when all its neighbors has been visited L(v)=N-x, x is the # of nodes has been labeled
3-Tuple Labeling for Reachability P1 P2 P P1 P2 P3 P4 [1,1] [2,2] [1,3] [1,4] u v if and only if 1) Interval label I(u) I(v) 2) DFS label L(u) L(v) ?Query(9,15) P4[1,4] P1[1,1] and 5 < 15 Yes ?Query(9,2) ?Query(5,9) P3
Transitive Closure Compression An efficient procedure can compute and compress the transitive closure in O(mk), k is number of paths in path-tree Path-tree cover (including labeling) can be constructed in O(m + n logn)
Key Problems How to construct a path-tree? –Algorithm How can path-tree help with reachability query? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality
Theoretical Analysis Optimal Path-Tree Cover (OPTC) Problem: –Given a path-decomposition, what is the optimal path- tree cover to maximally compress the transitive closure? –OptIndex weight assignment based on computing the predecessor set Optimal Path-Decomposition (OPD) Problem: –Assuming we only use path-decomposition to compress the transitive closure, what is the optimal path-decomposition to maximally compress the transitive closure? –Minimal-cost flow problem –What is the overall optimal path-decomposition?
Superiority of Path-Tree Cover The optimal tree cover is a special case of path-tree cover when each vertex corresponds to a single path and the weight is based on OptIndex. The path-tree cover approach can compress the transitive closure with size being smaller than or equal to the optimal tree cover approach (and consequently optimal chain cover approach).
Experimental Evaluation Implementation in C++ 12 Real datasets used in Dual-labeling paper and GRIPP paper Synthetic datasets –Sparse DAG with edge density = 2 AMD Opteron 2.0GHz/ 2GB/ Linux PTree1 (OptIndex) and PTree2 –Mainly compare with Optimal Tree Cover
Real Datasets Graph Name#V#EDAG #VDAG #E AgroCyc aMaze Anthra Ecoo HpyCyc Human Kegg Mtbrv Nasa Reactome Vchocyc Xmark
Experimental Result (Real Data) Transitive Closure SizeConstruction Time (in ms)Query Time (in ms) TreePtree-1Ptree-2TreePtree-1Ptree-2TreePtree-1Ptree-2 AgroCyc aMaze Anthra Ecoo HpyCyc Human Kegg Mtbrv Nasa Reactome Vchocyc Xmark On average 10 times better than TreeOn average 3 times better than Tree
Experimental Result (Synthetic Data)
Conclusion A novel Path-Tree structure is proposed to assist the compression of transitive closure and answering reachability query Path-tree has potential to integrate with other existing methods to further improve the efficiency of reachability query processing
Thanks!!
Step 3: Path-Graph Construction P1 P2 P3 P4 P1 P2 P3 P Weighted Directed Path-Graph Weight reflects the penalty if we exclude this path-tree edge
Step 2: Constructing Minimal Equivalent Edge Set (Pi Pj) P1 P2 P1 P2 1.Ordering the vertices in Pi and Pj by decreasing order 2.Finding the first vertex v in P_j that P_i can reach 3.Finding the last vertex u in P_i that reach v 4.Removing all the edges cross (u,v) and repeat 2-4
3-Tuple Labeling for Reachability P1 P2 P3 P4 P1 P2 P3 P4 DFS labeling (1-tuple) Interval labeling (2-tuple) High-level description about paths Pi Pj ? [1,1] [2,2] [1,3] [1,4]