Graph Indexing From managing and mining graph data
Introduction Advanced database systems face a great challenge arising from the emergence of massive, complex structural data in bioinformatics, cheminformatics, business processes, etc. One of the most important functions needed in these areas is efficient search of complex graph data. Given a graph query, it is desirable to retrieve relevant graphs quickly from a large database via efficient graph indices.
Feature-Based Graph Index Definition: Given a graph database = {1, 2,..., } and a query graph, substructure search is to find all the graphs that contain. Feature-based graph indexing is designed to answer substructure search queries, which consists of the following two major steps: Index construction Query Processing
Step 1: Index Construction Definition: it precomputes features from a graph database and builds indices based on these features There are various kinds of features that could be used, including node/edge labels, paths, trees, and subgraphs. Let be a feature set for a given graph database. For any feature ∈, is the set of graphs containing, = { ∣ ⊆, ∈ } ARTINYA : kumpulan graph database yang mengandung fitur label titik='x' adalah Graf dimana fitur label titik tsb adalah subset dari Graf dan Graf itu adalah anggota dari kumpulan graph database
Step 2: Query processing (1) Search, which enumerates all the features in a query graph,, to compute the candidate query answer set, = ∩ ( ⊆ and ∈ ); each graph in contains all of ’s features. Therefore, is a subset of. (2) Fetching, which retrieves the graphs in the candidate answer set from disks. (3) Verification, which checks the graphs in the candidate answer set to verify if they really satisfy the query.
Step 2: Query processing The Query Response Time of the above search framework is formulated as follows, where search is the time spent in the search step, is the average I/O time of fetching a candidate graph from the disk, and is the average time of checking a subgraph isomorphism, which is conducted over query and graphs in the candidate answer set. The candidate graphs are usually scattered around the entire disk. Thus, is the I/O time of fetching a block on a disk (assume a graph can be accommodated in one disk block). The value of does not change much for a given query. Therefore, the key to improve the query response time is to minimize the size of the candidate answer set as much as possible. When a database is so large that the index cannot be held in main memory, search will affect the query response time.
Caution !!! Since all the features in the index contained by a query are enumerated, it is important to maintain a compact feature set in the memory. Otherwise, the cost of accessing the index may be even greater than that of accessing the database itself.
Paths as features to index graph One solution to substructure search is to take paths as features to index graphs: Enumerate all the existing paths in a database up to a length and use them as features to index, where a path is a vertex sequence, 1,2,...,, s.t., ∀ 1 ≤ ≤ − 1, (, +1) is an edge. It uses the index to identify graphs that contain all the paths (up to the length) in the query graph.
Example of path v 2, e 7, v 5, e 6, v 4, e 3, v 3
GraphGrep (shasha et 02)
Advantages of path indexing Paths are easier to manipulate than trees and graphs, The index space is predefined: All the paths up to the length are selected.
Disadvantages of path indexing In order to answer tree- or graph- structured queries, a path- based approach has to break query graphs into paths, search each path separately for the graphs containing the path, and join the results. Since the structural information could be lost when query graphs are decomposed to paths, likely many false positive candidates will be returned. In addition, a graph database may contain millions of different paths if it is large and diverse. These disadvantages motivate the search of new indexing features.
Frequent Structures A straightforward approach of extending paths is to involve more complicated features, e.g., all of substructures extracted from a graph database. Unfortunately, the number of substructures could be even more than the number of paths, leaving an exponential index structure in practice. One solution is to set a threshold of substructures’ frequency and only index those frequent ones.
Frequent Structures Definition 5.2 (Frequent Structures). Given a graph database = { 1, 2,...,} and a graph structure, the support of is defined as ( ) = ∣ ∣, whereas is referred as ’ s supporting graphs. With a predefined threshold, is said to be frequent if ( ) ≥. Should a uniform min sup be enforced for all the frequent structures? In order to reduce the overall index size, it is appropriate to have a low minimum support on small structures (for effectiveness) and a high minimum support on large structures (for compactness).
Discriminative Structures Example: ABA = 3 ABAB = 3 ABAC = 3 ABCA = 3 ABC = 3 Which structure that we should index ? Among similar structures with the same support, it is often sufficient to index only the smallest common substructure s since more query graphs may contain these structures (higher coverage).
Trees Tree, which is also denoted as free tree, is a special connected, acyclic and undirected graph
C - Tree
Subgraph queries vs Supergraph queries Graph DB Sub Graph Query Supergraph Query Query Result
Subgraph Queries searches for a specific pattern in graph Database. Therefore, given a graph database D = {g 1,g 2,g 3 …g n } and a subgraph query q, the query answer set A = { g i ∣ q ⊆ g i, g i ∈ } Supergraph queries searches for the graph database members of which their whole structures are contained in the input query. Therefore, given a graph database D = {g 1,g 2,g 3 …g n } and a supergraph query q, the query answer set A = { g i ∣ g i ⊆ q, g i ∈ } Subgraph queries vs Supergraph queries
A B C D F B C D X B C E B C D F B C E g1g1 g2g2 g3g3 q1q1 q2q2 What is the answer set ?
Supergraph Query Processing Supergraph query processing problem is important but has not been extensively considered in the research literature. There are only two approaches have been presented to deal with this problem such as: cIndex by Chen et al. (2007) and GPTree by Zhang et al. (2009).
Conclusions (1) Graph indexing is one of the emerging important tasks in graph database management and graph data mining. It is fundamental to many graph related applications, especially when an application involves large scale graph databases. We have learn about substructure search, approximate substructure search, and feature-based graph indexing methods that mine and index a compact set of discriminative and selective structure features for fast graph retrieval. These methods are going to significantly improve the performance of advanced graph applications such as graph classification and clustering. There is a clear imbalance between the number of developed techniques for processing supergraph queries and the other types of graph queries. The reason behind this is that the supergraph query type can be considered to be relatively new. There- fore, there are many technical aspects which still remain unexplored.
Thank you…