Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Outline Category of graph queries Querying in collection DB References 2/22
Category of Graph Queries: Matching Type Exact subgraph matching –Find graphs in DB which have all components of the query graph Similarity subgraph matching –Find graphs in DB which have some components of the query graph –Similarity measure is needed Super graph matching –Find graphs in DB which are contained in the query graph Query graphExact subgraphSimilarity Subgraph Query graph 3/22
Category of Graph Queries: Target DB Collection DB: large number of small graphs –e.g. Chemical compounds –Retrieval component –IDs of graphs which contain matching parts Large graphs: small number of large graphs –e.g. Social network, RDF graph –Retrieval component –All matching subgraphs G1G1 G2G2 G3G3 G4G4 G7G7 G6G6 G5G5 Query graph G 1, G 3, G 5 Results: graph ID list Querying Collection DB Query graph Results: matching subgraphs Querying Large Graphs 4/22
Query Processing in Collection DB Processing flow Verification uses usual pair-wise subgraph isomorphism algorithm Most of techniques focus on filtering techniques –The cost of verification is high –To reduce the number of verification execution Query Filtering Candidate graph set Verification Answer Graphs 5/22
Query Processing in Large Graphs Processing flow Focus on node indexing –To reduce search space –Use structural information of nodes Build subgraph by joining candidate nodes –Join methods are not relatively researched –Optimization using join ordering Query Index search Candidate node sets Building subgraphs Answer subgraphs 6/22
Graph Indexing Techniques Target DatabaseQuery Type GraphGrep [Shasha et al., PODS’02] Collection DBExactFeature(Path) based index gIndex [Yan et al., SIGMOD’04] Collection DBExactFeature(Graph) based index Grafil [Yan et al., SIGMOD’05] Collection DBExact & SimilarityFeature based similarity search C-tree [He and Singh, ICDE’06] Collection DBExact & SimilarityClosure based index QuickSI [Shang et al., VLDB’08] Collection DBExactVerification algorithm Tale [Tian and Patel, ICDE’08] Collection DBExact & SimilaritySimilarity search using node index GraphQL [He and Singh, SIGMOD’08] Large graphsExactNode indexing Spath [Zhao and Han, VLDB’10] Large graphsExact Node indexing using neighborhood information 7/22
Outline Category of graph queries Querying in collection DB References 8/22
GraphGrep(1/2) [Shasha et al. PODS’02] First work adopts the filtering-and-verification framework Path-based index –Fingerprint of database –Enumerate the set of all paths(length <= L) of all graphs in DB –For each path, the number of occurrences in each graphs are stored in hash table B A C B B A C B D E C AB B C Keyg1g1 g2g2 g3g3 h(CA)101 … h(ABCB)220 g1g1 g2g2 g3g3 Index 9/22
GraphGrep(2/2): Query Processing Filtering –Make the fingerprint of query q –Hash all paths (length <= L) of q –Compare the fingerprint of the query with the fingerprint of database –Discard a graph whose value in fingerprint is less than the value in query fingerprint Verification –Check subgraph isomorphism tests Keyg1g1 g2g2 g3g3 h(AB)221 h(AC)101 h(BAC)201 B A C B B A C B D E C AB B C g1g1 g2g2 g3g3 Index B AC AB:1 AC:1 BAC:1 Query Candidates = {g 1, g 3 } Verification 10/22
gIndex(1/6) [Yan et al., SIGMOD’04] Path-based approach has week points –Path is too simple: structural information is lost –There are too many paths: the set of paths in a graph database usually is huge Solution –Use graph structure instead of path as the basic index feature cccc cc cc cc c c cc c c c c c c Sample Database c cc c c c Query ccc ccc Paths in Query Graph Cannot Filter Any Graphs In Database 11/22
gIndex(2/6): Frequent Fragment The number of graph structure is large Index only frequent subgraphs support(g) –The number of graphs in D (graph database), where g is a subgraph minSup –Minimum support threshold –Index a fragment, g only if support(g) ≥ minSup Size-increasing support –Frequent fragments are increasing as the size of a fragment increases –Low minSup for small fragments, high minSup for large fragment 12/22
gIndex(3/6): Frequent Fragment AA B AA BB AA BB A A BB AA AB AAB ABB BAB ABA AB B A AA B A BB BA B A BA B A BB A A A B B A A A BB Size=1 Size=2Size=3 Size=4 F=3 F=4 BB F=3 F=2 F=1 F=2 F=1 minSup=1 minSup=2 13/22
gIndex(4/6): Discriminative Fragment AA B AA BB AA BB AAB ABB AB B A Size=2 Size=3 D f1 ={g 1, g 2, g 3 } D f2 ={g 2, g 3, g 4 } D f3 ={g 2, g 3 }=D f1 ∩D f2 f1f1 f2f2 f3f3 g1g1 g2g2 g3g3 A A BB g4g4 14/22
a gIndex(5/6): gIndex Tree Use graph serialization method –For fast graph isomorphism checking during index search –DFS coding [Yan et al. ICDM’02] –Translate a graph into a unique edge sequence gIndex Tree –Prefix tree which consists of the edge sequences of discriminative fragments –Record all size-n discriminative fragments in level n –Black nodes discriminative fragments –Have ID lists: the ids of graphs containing f i –White nodes redundant fragments; for Apriori pruning X X ZY b a b a X X ZY b b a v0v0 v1v1 v2v2 v3v3 DFS Coding f1f1 f2f2 f3f3 e1e1 e2e2 e3e3 Level 0 Level 1 Level 2 … gIndex Tree 15/22
gIndex(6/6): Searching Searching process –Given a query q, enumerate all q’s fragments (size <= maxSize) –Locate the fragments in gIndex tree –Intersect the id lists associated with the fragments Apriori pruning –Generating every fragment is inefficient –If a fragment is not in gIndexTree, we need not check its super-graphs any more –Redundant fragments need to be recorded for Apriori pruning f1f1 f2f2 f3f3 e1e1 e2e2 e3e3 Level 0 Level 1 Level 2 … gIndex Tree Query Fragments stop … 16/22
Grafil(1/4) [Yan et al., SIGMOD’05] Feature Vector G1G1 {u 1, u 2, …, u n } G2G2 … GnGn Subgraph exact search Subgraph similarity search {v 1, v 2, …, v n } Query 17/22
Grafil(2/4): Feature Misses Query Relaxed Queries Features fafa fbfb fcfc fafa fbfb fcfc 124 fafa fbfb fcfc 103 fafa fbfb fcfc 012 fafa fbfb fcfc 012 Miss 1 edges =4 =3 Feature Miss 7-4=3 7-3=4 Maximum Feature Misses m max =4 18/22
Grafil(3/4): Feature Miss Estimation Problem –Given a query Q and a set of features contained in Q, if the relaxation ratio is given, what is the maximal number of features that can be missed? Use edge-feature matrix –Find the maximum number of columns that can be hit by k rows –K: the number of missing edges in Q Classic maximum coverage problem (set k-cover) –Proved NP-complete Features fafa fbfb fcfc Query fafa f b1 f b2 f c1 f c2 f c3 f c4 e1e e2e e3e Edge-Feature Matrix e1e1 e2e2 e3e3 19/22
Grafil(4/4): Feature Conjugation Compensate the misses of a feature by occurrences of another features in G Using all the features together in one filter would deteriorate the filtering performance Solution –Use multiple filters –Feature set selection Query Features fafa fafa fbfb 34 m max =4 (3-0)+0=3 ≤ m max A B A A A A C B B B fbfb C A A A A A C Graph Relaxation Ratio = 1 20/22
Graph Indexing Techniques Target DatabaseQuery Type GraphGrep [Shasha et al., PODS’02] Collection DBExactFeature(Path) based index gIndex [Yan et al., SIGMOD’04] Collection DBExactFeature(Graph) based index Grafil [Yan et al., SIGMOD’05] Collection DBExact & SimilarityFeature based similarity search C-tree [He and Singh, ICDE’06] Collection DBExact & SimilarityClosure based index QuickSI [Shang et al., VLDB’08] Collection DBExactVerification algorithm Tale [Tian and Patel, ICDE’08] Collection DBExact & SimilaritySimilarity search using node index GraphQL [He and Singh, SIGMOD’08] Large graphsExactNode indexing Spath [Zhao and Han, VLDB’10] Large graphsExact Node indexing using neighborhood information 21/22
References [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, Rosalba Giugno, Algorithmics and Applications of Tree and Graph Searching. PODS, [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing: A Frequent Structure-based Approach. SIGMOD, [Yan et al., SIGMOD’05] Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure Similarity Search in Graph Databases. SIGMOD, [Tian and Patel, ICDE’08] Yuanyuan Tian, Jignesh M. Patel. TALE: A Tool for Approximate Large Graph Matching. ICDE, [He and Singh, SIGMOD’08] Huahai He, Ambuj K. Singh. Graphs-at-a-time: query language and access methods for graph databases. SIGMOD, [Zhao and Han, VLDB’10] Peiziang Zhao, Jiawei Han. On Graph Query Optimization in Large Networks. VLDB, [He and Singh, ICDE’06] Huahai He, Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries. ICDE, 2006 [Shang et al., VLDB’08] Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu Yu, Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism. VLDB, /22