IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

Slides:



Advertisements
Similar presentations
Lindsey Bleimes Charlie Garrod Adam Meyerson
Advertisements

gSpan: Graph-based substructure pattern mining
Greedy Algorithms Greed is good. (Some of the time)
Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.
STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.
University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Mining Graphs.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Association Analysis (7) (Mining Graphs)
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Chapter 8 File organization and Indices.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Copyright © Curt Hill Query Evaluation Translating a query into action.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Spanning Trees CSIT 402 Data Structures II 1. 2 Two Algorithms Prim: (build tree incrementally) – Pick lower cost edge connected to known (incomplete)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey Xu Yu, Guimei Liu, Hongjun Lu, Proc. of the 2002 IEEE International.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Minimum Spanning Trees CSE 373 Data Structures Lecture 21.
Minimum- Spanning Trees
Trees Thm 2.1. (Cayley 1889) There are nn-2 different labeled trees
MA/CSSE 473 Day 34 MST details: Kruskal's Algorithm Prim's Algorithm.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Applied Discrete Mathematics Week 15: Trees
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Chapter 12: Query Processing
Mining Frequent Subgraphs
Graph Search with Indexing
Lecture 2- Query Processing (continued)
ITEC 2620M Introduction to Data Structures
Chapter 12 Query Processing (1)
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
And the Final Subject is…
Minimum Spanning Tree.
Minimum Spanning Trees
Lecture 10 Graph Algorithms
Presentation transcript:

iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu

Why I choose this paper Disk-based Implementation technique Graph database Application Dataset iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu

Why I choose this paper Disk-based Implementation technique Graph database Application Dataset

Why they write this paper Provide a uniform test framework. Binary executable wall clock time comparison is not fair. Some algorithms are in-memory implemented while others are on-disk implemented. Obtain real disk I/Os by bypassing OS disk cache. Perform a large number of tests. Why I choose this paper Disk-based Implementation technique Graph database Application Dataset

Why they write this paper Provide a uniform test framework. Binary executable wall clock time comparison is not fair. Some algorithms are in-memory implemented while others are on-disk implemented. Obtain real disk I/Os by bypassing OS disk cache. Perform a large number of tests.

Background Application Graph isomorphism Stream A large number of small graphs Undirected labeled graph G1 = (V;E;Lv;L1e) Why they write this paper Provide a uniform test framework. Binary executable wall clock time comparison is not fair. Some algorithms are in-memory implemented while others are on-disk implemented. Obtain real disk I/Os by bypassing OS disk cache. Perform a large number of tests.

Background Application Graph isomorphism Stream A large number of small graphs Undirected labeled graph G1 = (V;E;Lv;L1e)

Related work Mining based approaches Non-mining based approaches Size : #Edges Background Application Graph isomorphism Stream A large number of small graphs Undirected labeled graph G1 = (V;E;Lv;L1e)

Related work Mining based approaches Non-mining based approaches Size : #Edges

FG-Index Indexing All frequent subgraphs All infrequent edges Query Enumerate a subset of subgraphs Verification-free strategy gIndex Indexing All frequent subgraphs (maxL) A subset of infrequent subgraphs (maxL) Discrimitive features Query Enumerate all subgraphs (maxL) Related work Mining based approaches Non-mining based approaches Size : #Edges

FG-Index Indexing All frequent subgraphs All infrequent edges Query Enumerate a subset of subgraphs Verification-free strategy gIndex Indexing All frequent subgraphs (maxL) A subset of infrequent subgraphs (maxL) Discrimitive features Query Enumerate all subgraphs (maxL)

SwiftIndex Indexing All frequent trees size up to maxL All discriminative trees size up to maxL All infrequent edges Query PrefixQuickSI SwiftIndex Indexing All frequent trees size up to maxL All discriminative trees size up to maxL All infrequent edges Query PrefixQuickSI Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL) FG-Index Indexing All frequent subgraphs All infrequent edges Query Enumerate a subset of subgraphs Verification-free strategy gIndex Indexing All frequent subgraphs (maxL) A subset of infrequent subgraphs (maxL) Discrimitive features Query Enumerate all subgraphs (maxL) Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL)

SwiftIndex Indexing All frequent trees size up to maxL All discriminative trees size up to maxL All infrequent edges Query PrefixQuickSI Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL)

C-Tree Indexing A hierarchical tree of graph closure Query Pseudo subgraph isomorphism test SwiftIndex Indexing All frequent trees size up to maxL All discriminative trees size up to maxL All infrequent edges Query PrefixQuickSI Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL) GraphGrep Indexing All paths (maxL) Query Enumerate all paths (maxL)

C-Tree Indexing A hierarchical tree of graph closure Query Pseudo subgraph isomorphism test GraphGrep Indexing All paths (maxL) Query Enumerate all paths (maxL)

Isomorphism Algorithms VF2 QuickSI C-Tree Indexing A hierarchical tree of graph closure Query Pseudo subgraph isomorphism test GraphGrep Indexing All paths (maxL) Query Enumerate all paths (maxL) gCode Indexing Vertex signature from neighbors Graph signature from vertex GCode-Tree Query Index level (graph signature) Object level (vertex signature)

Isomorphism Algorithms gCode Indexing Vertex signature from neighbors Graph signature from vertex GCode-Tree Query Index level (graph signature) Object level (vertex signature) VF2 QuickSI

Implementation Graph A list of vertices and a list of edges If a graph is less than the page size Store it as a tuple in a heap page Else Store it as a BLOB B+-tree for all graphs by graph ID Other techniques CAM code to encode feature Djb2 hash function Mini-page Isomorphism Algorithms gCode Indexing Vertex signature from neighbors Graph signature from vertex GCode-Tree Query Index level (graph signature) Object level (vertex signature) VF2 QuickSI

Implementation Graph A list of vertices and a list of edges If a graph is less than the page size Store it as a tuple in a heap page Else Store it as a BLOB B+-tree for all graphs by graph ID Other techniques CAM code to encode feature Djb2 hash function Mini-page

Dataset Small sparse AIDS: graphs vertices and edges 51 vertex lables and 4 edge labels Small dense GraphGen: graphs 7 vertices and 30 edges 20 vertex lables and 20 edge labels Large PubChem: graphs vertices and edges 81 vertex lables and 3 edge labels Implementation Graph A list of vertices and a list of edges If a graph is less than the page size Store it as a tuple in a heap page Else Store it as a BLOB B+-tree for all graphs by graph ID Other techniques CAM code to encode feature Djb2 hash function Mini-page

Dataset Small sparse AIDS: graphs vertices and edges 51 vertex lables and 4 edge labels Small dense GraphGen: graphs 7 vertices and 30 edges 20 vertex lables and 20 edge labels Large PubChem: graphs vertices and edges 81 vertex lables and 3 edge labels

Query sets For AIDS: the existing query sets Q4, Q8, · · ·, Q24 can be downloaded from [3]. Each query set Qn contains 1000 graphs where each graph size is n. For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets. Dataset Small sparse AIDS: graphs vertices and edges 51 vertex lables and 4 edge labels Small dense GraphGen: graphs 7 vertices and 30 edges 20 vertex lables and 20 edge labels Large PubChem: graphs vertices and edges 81 vertex lables and 3 edge labels

Query sets For AIDS: the existing query sets Q4, Q8, · · ·, Q24 can be downloaded from [3]. Each query set Qn contains 1000 graphs where each graph size is n. For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Disk schedule LRU as buffer replacement algorithm Page size: 8 K FILE_FLAG_NO_BUFFERING Query sets For AIDS: the existing query sets Q4, Q8, · · ·, Q24 can be downloaded from [3]. Each query set Qn contains 1000 graphs where each graph size is n. For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24. In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Disk schedule LRU as buffer replacement algorithm Page size: 8 K FILE_FLAG_NO_BUFFERING

Experiment The database construction cost of gIndex is comparable to all feature selectionmethods such as Tree+∆, FG-Index, and SwiftIndex. “We have communicated with Xifeng Yan who first ignored the edge label. He did that simply in order to “make the problem more difficult.” Subsequent work imitated his setting without clear reason.”

Experiment More features, less candidates. The gIndex performs the best.

Experiment For Q4, FG-Index performs the best since it exploits the verification-free strategy. gCode performs the worst: 1) more candidates 2) lookups over the vertex signature dictionary need more buffering. gCode Indexing Vertex signature from neighbors Graph signature from vertex GCode-Tree Query Index level (graph signature) Object level (vertex signature)

Experiment As for C-Tree, the number of disk I/Os is slightly reduced compared with a small buffer size, since the database size of C-Tree is still larger than the buffer size, and tree traversal incurs the sequential flooding effect C-Tree Indexing A hierarchical tree of graph closure Query Pseudo subgraph isomorphism test

Experiment gIndex is slightly slower than FG-Index and SwiftIndex due to slow subgraph enumeration from a query. This fact indicates that the I/O cost must be carefully optimized to obtain good performance. gIndex Indexing All frequent subgraphs (maxL) A subset of infrequent subgraphs (maxL) Discrimitive features Query Enumerate all subgraphs (maxL)

Experiment Only 37 frequent features. Almost all features in FG- Index, Tree+∆, and SwiftIndex are infrequent features. gCode use signatures. gIndex mines all infrequent and discriminative features of size up to 3. gIndex Indexing All frequent subgraphs (maxL) A subset of infrequent subgraphs (maxL) Discrimitive features Query Enumerate all subgraphs (maxL) gCode Indexing Vertex signature from neighbors Graph signature from vertex GCode-Tree Query Index level (graph signature) Object level (vertex signature)

Experiment Drastic changes to gCode (I), C-Tree (I), and Tree+∆. Frequent feature space is small. Graph features reclaimed at small sizes are used for larger query sizes. Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL)

Experiment FG-Index does not outperform gIndex even for Q4 since there exist no frequent features of size 4. Queries in this dense synthetic dataset contain many cycles, and thus, the cost of mining graph features on the fly is very high. Tree+Δ Indexing All frequent trees size up to maxL – 1 All infrequent edges Generates graph features on the fly Query Enumerate all subtrees (maxL)

Experiment The number of index features used by FG-Index or SwiftIndex is much smaller than gIndex. This result indicates that more features in the index simply do not guarantee better performance.

Experiment The trends of all curves are consistent with those for the number of I/Os. gIndex shows the best performance in both cold and hot runs for a moderate dense dataset.

Experiment gCode performs the best for large query sizes with high density gIndex performs comparatively better for a larger number of labels since its pruning cost is relatively more effective

Results for Large Graph Database Since both SeqScan and C-Tree require prohibitive times to finish the experiments even with large buffer sizes, we exclude them from a large graph database. As for gCode, we can run experiments with a 1 GByte buffer and hot run; with smaller buffer sizes than 1 GByte and cold run, we are unable to finish the experiments within a week.

Results for Large Graph Database FG-Index’s pruning power is up to times lower than gIndex, since FG-Index uses a strategy to select a subset of features in its index to minimize the filtering cost.

Results for Large Graph Database For Q4, FG-Index performs the best due to its verification-free strategy For Q8 ∼ Q12, gIndex performs the best since its pruning power is the best For Q16 ∼ Q24, either SwiftIndex or FG-Index performs the best since their posting list intersection costs are the least.

Results for Large Graph Database Although gIndex performs worse than SwiftIndex and FG-Index in the number of I/Os for large query sizes, it performs the best for all query sizes except Q4 due to a good combination of the lowest number of candidates and low disk I/O costs.

Conclusion Overall winner: gIndex. Large query on dense graph, we recommend gCode. Souce code: