Graph Indexing From managing and mining graph data.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Indexing DNA Sequences Using q-Grams
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
gSpan: Graph-based substructure pattern mining
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Frequent Closed Pattern Search By Row and Feature Enumeration
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
MCS312: NP-completeness and Approximation Algorithms
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Querying Structured Text in an XML Database By Xuemei Luo.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
Melbourne, Australia, Oct., 2015 gSparsify: Graph Motif Based Sparsification for Graph Clustering Peixiang Zhao Department of Computer Science Florida.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Outline Introduction State-of-the-art solutions
Indexing and hashing.
Greedy Technique.
Parallel Databases.
Graph Search with Indexing
On Efficient Graph Substructure Selection
Design of Declarative Graph Query Languages: On the Choice between Value, Pattern and Object based Representations for Graphs Hasan Jamil Department of.
Efficient Subgraph Similarity All-Matching
Advance Database System
Donghui Zhang, Tian Xia Northeastern University
Approximate Graph Mining with Label Costs
Presentation transcript:

Graph Indexing From managing and mining graph data

Introduction Advanced database systems face a great challenge arising from the emergence of massive, complex structural data in bioinformatics, cheminformatics, business processes, etc. One of the most important functions needed in these areas is efficient search of complex graph data. Given a graph query, it is desirable to retrieve relevant graphs quickly from a large database via efficient graph indices.

Feature-Based Graph Index Definition: Given a graph database = {1, 2,..., } and a query graph, substructure search is to find all the graphs that contain. Feature-based graph indexing is designed to answer substructure search queries, which consists of the following two major steps: Index construction Query Processing

Step 1: Index Construction Definition: it precomputes features from a graph database and builds indices based on these features There are various kinds of features that could be used, including node/edge labels, paths, trees, and subgraphs. Let be a feature set for a given graph database. For any feature ∈, is the set of graphs containing, = { ∣ ⊆, ∈ } ARTINYA : kumpulan graph database yang mengandung fitur label titik='x' adalah Graf dimana fitur label titik tsb adalah subset dari Graf dan Graf itu adalah anggota dari kumpulan graph database

Step 2: Query processing (1) Search, which enumerates all the features in a query graph,, to compute the candidate query answer set, = ∩ ( ⊆ and ∈ ); each graph in contains all of ’s features. Therefore, is a subset of. (2) Fetching, which retrieves the graphs in the candidate answer set from disks. (3) Verification, which checks the graphs in the candidate answer set to verify if they really satisfy the query.

Step 2: Query processing The Query Response Time of the above search framework is formulated as follows, where search is the time spent in the search step, is the average I/O time of fetching a candidate graph from the disk, and is the average time of checking a subgraph isomorphism, which is conducted over query and graphs in the candidate answer set. The candidate graphs are usually scattered around the entire disk. Thus, is the I/O time of fetching a block on a disk (assume a graph can be accommodated in one disk block). The value of does not change much for a given query. Therefore, the key to improve the query response time is to minimize the size of the candidate answer set as much as possible. When a database is so large that the index cannot be held in main memory, search will affect the query response time.

Caution !!! Since all the features in the index contained by a query are enumerated, it is important to maintain a compact feature set in the memory. Otherwise, the cost of accessing the index may be even greater than that of accessing the database itself.

Paths as features to index graph One solution to substructure search is to take paths as features to index graphs: Enumerate all the existing paths in a database up to a length and use them as features to index, where a path is a vertex sequence, 1,2,...,, s.t., ∀ 1 ≤ ≤ − 1, (, +1) is an edge. It uses the index to identify graphs that contain all the paths (up to the length) in the query graph.

Example of path v 2, e 7, v 5, e 6, v 4, e 3, v 3

GraphGrep (shasha et 02)

Advantages of path indexing Paths are easier to manipulate than trees and graphs, The index space is predefined: All the paths up to the length are selected.

Disadvantages of path indexing In order to answer tree- or graph- structured queries, a path- based approach has to break query graphs into paths, search each path separately for the graphs containing the path, and join the results. Since the structural information could be lost when query graphs are decomposed to paths, likely many false positive candidates will be returned. In addition, a graph database may contain millions of different paths if it is large and diverse. These disadvantages motivate the search of new indexing features.

Frequent Structures A straightforward approach of extending paths is to involve more complicated features, e.g., all of substructures extracted from a graph database. Unfortunately, the number of substructures could be even more than the number of paths, leaving an exponential index structure in practice. One solution is to set a threshold of substructures’ frequency and only index those frequent ones.

Frequent Structures Definition 5.2 (Frequent Structures). Given a graph database = { 1, 2,...,} and a graph structure, the support of is defined as ( ) = ∣ ∣, whereas is referred as ’ s supporting graphs. With a predefined threshold, is said to be frequent if ( ) ≥. Should a uniform min sup be enforced for all the frequent structures? In order to reduce the overall index size, it is appropriate to have a low minimum support on small structures (for effectiveness) and a high minimum support on large structures (for compactness).

Discriminative Structures Example: ABA = 3 ABAB = 3 ABAC = 3 ABCA = 3 ABC = 3 Which structure that we should index ? Among similar structures with the same support, it is often sufficient to index only the smallest common substructure s since more query graphs may contain these structures (higher coverage).

Trees Tree, which is also denoted as free tree, is a special connected, acyclic and undirected graph

C - Tree

Subgraph queries vs Supergraph queries Graph DB Sub Graph Query Supergraph Query Query Result

Subgraph Queries searches for a specific pattern in graph Database. Therefore, given a graph database D = {g 1,g 2,g 3 …g n } and a subgraph query q, the query answer set A = { g i ∣ q ⊆ g i, g i ∈ } Supergraph queries searches for the graph database members of which their whole structures are contained in the input query. Therefore, given a graph database D = {g 1,g 2,g 3 …g n } and a supergraph query q, the query answer set A = { g i ∣ g i ⊆ q, g i ∈ } Subgraph queries vs Supergraph queries

A B C D F B C D X B C E B C D F B C E g1g1 g2g2 g3g3 q1q1 q2q2 What is the answer set ?

Supergraph Query Processing Supergraph query processing problem is important but has not been extensively considered in the research literature. There are only two approaches have been presented to deal with this problem such as: cIndex by Chen et al. (2007) and GPTree by Zhang et al. (2009).

Conclusions (1) Graph indexing is one of the emerging important tasks in graph database management and graph data mining. It is fundamental to many graph related applications, especially when an application involves large scale graph databases. We have learn about substructure search, approximate substructure search, and feature-based graph indexing methods that mine and index a compact set of discriminative and selective structure features for fast graph retrieval. These methods are going to significantly improve the performance of advanced graph applications such as graph classification and clustering. There is a clear imbalance between the number of developed techniques for processing supergraph queries and the other types of graph queries. The reason behind this is that the supergraph query type can be considered to be relatively new. There- fore, there are many technical aspects which still remain unexplored.

Thank you…