GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

5.1 Real Vector Spaces.
Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Lecture 24 MAS 714 Hartmut Klauck
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
CSC 421: Algorithm Design & Analysis
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Fast Algorithms For Hierarchical Range Histogram Constructions
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.
Chapter 5 Decrease and Conquer. Homework 7 hw7 (due 3/17) hw7 (due 3/17) –page 127 question 5 –page 132 questions 5 and 6 –page 137 questions 5 and 6.
Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian.
Chapter 6: Transform and Conquer
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
1 Branch and Bound Searching Strategies 2 Branch-and-bound strategy 2 mechanisms: A mechanism to generate branches A mechanism to generate a bound so.
Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
Reachability Queries Sept. 2014Yangjun Chen ACS Outline: Reachability Query Evaluation What is reachability query? Reachability query evaluation.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
The Analysis of Variance
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
MA/CSSE 473 Day 12 Insertion Sort quick review DFS, BFS Topological Sort.
Network Aware Resource Allocation in Distributed Clouds.
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Chapter 2 Graph Algorithms.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
MA/CSSE 473 Day 15 BFS Topological Sort Combinatorial Object Generation Intro.
Discussion #32 1/13 Discussion #32 Properties and Applications of Depth-First Search Trees.
Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm CS 6/73201 Advanced Operating System Presentation by: Sanjitkumar Patel.
A dynamic algorithm for topologically sorting directed acyclic graphs David J. Pearce and Paul H.J. Kelly Imperial College, London, UK
Properties and Applications of Depth-First Search Trees and Forests
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
Branch and Bound Searching Strategies
 2004 SDU 1 Lecture5-Strongly Connected Components.
Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
Breadth-First Search (BFS)
Main algorithm with recursion: We’ll have a function DFS that initializes, and then calls DFS-Visit, which is a recursive function and does the depth first.
Ch9: Decision Trees 9.1 Introduction A decision tree:
CS 3343: Analysis of Algorithms
Spatial Online Sampling and Aggregation
Effective Social Network Quarantine with Minimal Isolation Costs
Fast Nearest Neighbor Search on Road Networks
Lectures on Graph Algorithms: searching, testing and sorting
ITEC 2620M Introduction to Data Structures
Trevor Brown DC 2338, Office hour M3-4pm
Presentation transcript:

GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki

1. INTRODUCTION  Problem Definition  Given a directed graph G = (V,E) and two nodes u, v ∈ V, a reach ability query asks if there exists a path from u to v in G. If u can reach v, we denote it as u → v  Application  Semantic Web is composed of RDF/OWL data and Reach ability queries to infer the relationships among the objects.  network biology, reachability querying protein-protein interaction networks

DAG 1  the problem of reachability on directed graphs can be reduced to reachability on directed acyclic graphs (DAGs)  equivalent DAG G′  each node represents a strongly connected component of original graph, each edge represents whether one component can reach another.

DAG 2  answer whether node u can reach v in G  look up their corresponding strongly connected components, Su and Sv, respectively, which are the nodes in G′. If Su = Sv, then u and reach v, If Su != Sv, then we pose the question whether Su can reach Sv in G′.  reachability queries on the original graph can be answered on the DAG, and thus we will discuss methods for reachability only on DAGs.

Tradeoff between Query Time and Index Size

Basic Approach: interval labeling  reachability problem on trees can be solved effectively by interval labeling ; linear time and space for constructing the index, and provides constant time querying.  Definition  each node u with a range Lu = [rx, ru]  ru:denotes the rank of the node u in a post-order traversal of the tree, ranks begin at 1, and all the children of a node ordered and fixed for that traversal(reversed visiting order compared with rx)

 rx: DFS from root, number of nodes that has been popped + 1 currently

Theorm 0  For trees, u → v ⇐⇒ Lv ⊆ Lu.  Proof:  Consider one step reachability,  if u → v in one step, then for rx: u will be visited before v, so rx(u) = ru(v), clearly Lv ⊆ Lu.  If Lv ⊆ Lu, then rx(u)<=rx(v), rx(u)<=rx(v) means u is visited before v in DFS process, and consider constraint of tree, so u → v in one step  Now because relationship u → v, Lv ⊆ Lu is equivalent relationship, and path_num(u,v) is at most 1, so u → v ⇐⇒ Lv ⊆ Lu(1 step) can deduce u → v ⇐⇒ Lv ⊆ Lu(n step)

generalize the interval labeling to a DAG  ensure that a node is not visited more than once, and a node will keep the post-order rank of its first visit.  interval containment of nodes in a DAG is not exactly equivalent to reachability.  For example, 5 !→ 4, but L4 = [1, 5] ⊆ [1, 8] = L5.

3. THE GRAIL APPROACH  The key idea is to do fast elimination for pairs of query nodes for whom non- reachability can be determined via the intervals. random  employs multiple intervals that are obtained via random graph traversals. symbol d denote the number of intervals to keep per node

THE GRAIL APPROACH  Definition

THEOREM 1

two main issues in GRAIL  i) how to compute the d random interval labels while indexing  ii) how to deal with exceptions, while querying

3.1 Index Construction  The best strategy is to cease labeling after a small number of dimensions (such as 5), with reduced exceptions, rather than trying to totally eliminate all exceptions

Traversal Strategies  Randomized: shown in Algorithm 1, with a random traversal order for each dimension.  Randomized Pairs: first randomize the order of the roots and children, and fix it. We then generate pairs of labeling, using left-to-right (L- R) and right-to-left (R-L) traversals.  Bottom Up: this strategy we conceptually “reverse the edges” and process the nodes in reverse topological order. The bottom-up traversal can be done at random, or in random- pairs.

3.2 Reachability Queries  Exception’s Definition  For example, for the DAG in Figure 2(b), we can see that E2 = {1, 4}, E4 = {3, 7, 9},  keeping explicit exception lists per node does not scale to very large graphs. Default approach in GRAIL is to use a “smart” DFS, with recursive containment check based pruning, to answer queries.

Computational Complexity  querying takes O(d) time if Lv ! ⊆ Lu  exception lists are to be used, and they are maintained in a hash table, then the check in Line 3 takes O(1) time; otherwise, if the exceptions list is kept sorted, then the times is O(log(|Eu|)).  default option is to perform DFS, but note that it is possible we may terminate early due to the containment based pruning. Thus the worst case complexity is O(n + m) for the DFS

4.1 Datasets  Small-Sparse: These are small, real graphs, with average degree less than 1.2,  Small-Dense: These are small, dense real- world graphs  Large-Real: To evaluate the scalability of GRAIL on real datasets, we collected 7 new datasets which have previously not been been used by existing methods  Large-Synthetic: To test the scalability with different density setting, we generated random DAGs, ranging with 10M and 100M nodes, with average degrees of 2, 5, and 10.

4.2 Small Real Datasets: Sparse and Dense  Tables 7, 8, and 9 show the index construction time, query time, and index size for the small, sparse, real datasets.  Tables 10, 11, and 12 give the corresponding values for the small, dense, real datasets.  The last column in Tables 8 and 11 shows the number of reachable query node-pairs out of the 100K test queries; the query node-pairs are sampled randomly from the graphs

4.2 Small Real Datasets: Sparse and Dense  Conclusion  On the sparse datasets, GRAIL (using d = 2 traversals) has the smallest construction time  In terms of query time, PathTree is the best; it is times faster than GRAIL  DFS gives reasonable query performance  On the small dense datasets, GRAIL (with d = 2) has the smallest construction times and index size.  indexing does deliver significant benefits in terms of query time performance.

4.3 Large Datasets: Real and Synthetic  Table 13 shows the construction time, query time, and index size for GRAIL (d = 5) and pure DFS, on the large real datasets.  We also ran PathTree, on cit-patents and citeseerx is aborted with a memory limit error (–(m)), whereas for the other datasets it exceeded the 20M ms time limit (– (t)).  Table 14 shows the construction time, query time and index sizes for GRAIL and DFS on the large synthetic graph  GRAIL is uniformly better than DFS in all cases.  we conclude that GRAIL is the most scalable reachability index for large graphs, especially with increasing density

4.4 GRAIL: Sensitivity  Table 15 shows the effect of using exception lists in GRAIL  Using exceptions does help in some cases, but increase construction time, and the large size of exception lists, do not justify the small gains. exceptions could not be constructed on the large real graphs.

Number of Traversals/Intervals (d)

 increasing the number of intervals increases construction time, but yields decreasing query times.  at some point the overhead of checking a larger number of intervals negates the potential reduction in exceptions. (query time increases from d = 4 to d = 5 for ecoo)

trade-off in query cost v.s. index size  To estimate the number of traversals that minimize the query time, or that optimize the index size/query time trade-off is not straightforward.  However, for any practical benefits it is imperative to keep the index size smaller than the graph size. This loose constraint restricts d to be less than the average degree.

Effect of Reachability  For all of the experiments above, we issue 100K random query pairs. However, since the graphs are very sparse, the vast majority of these pairs are not reachable  As an alternative, we generated 100K reachable pairs by simulating a random walk  Tables 16 and 17 show the query time performance of GRAIL and pure DFS for the 100K random and 100K only positive queries, on some small and large graphs

Observation  Generally speaking, querying only reachable pairs takes longer (from 2-30 times) for both GRAIL and DFS. Note also that GRAIL is 2-4 times faster than DFS on positive queries, and up to 30 times faster on the random ones.

Effect of Query Distribution  for query sets with all reachable (positive) node-pairs, CV (coefficient of variation) decreases for GRAIL since the likelihood of pruning and early termination of the query decreases.

Effect of Density  generating random DAGs with 10 million nodes, and varying the average density from 2 to 10, as shown in Figure 5.  both the construction and query time increase with increasing density.  GRAIL (with d =5) is an order of magnitude faster than pure DFS in query time.

Question Remained  The question whether there exists an interval labeling with d dimensions that has no exceptions, is likely to be NP-complete.  My Opinion:  Algorithm 1 is based on DFS coding vertex  Each coding instance could be mapped to a DFS tree of graph G  So, d should as large as Number of trees for G to guarantee no exceptions  Number of trees for G is exponential for nodes number in G  That problem is not NPC