Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of New South Wales & NICTA, Australia) Ying Zhang (The University of New South Wales, Australia) Jeffrey Xu Yu (Chinese University of Hong Kong, China) Wei Wang(The University of New South Wales & NICTA, Australia)
Outline 1. Motivation 2. Similarity Measure 3. Techniques 4. Experimental Study 5. Conclusion
Application 1. Chemistry 2. Bioinformatics 3. Software Engineering 4. Social Network Chemical Compounds
Substructure Search
Substructure Similarity Search Why Similarity Search? Input Mistake Exploration......
Substructure Similarity Search Why Similarity Search? Input Mistake Exploration Existing Work SIGMOD’05 Grafil ICDE’06 Closure-tree ICDE’07 GDIndex VLDB’09 Comparing Stars
Graph Similarity Subgraph Similarity Similarity Measures Maximum Common Subgraph (MCS) (# of missing edges) Edit Distance. Variants. No enforcement of connectivity.
Graph Similarity A New Similarity Measure. Maximum Connected Common Subgraph – MCCS (counting missing edges while retaining the connectivity)
Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 )
Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 ) Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query)
Graph Similarity Maximum Connected Common Subgraph – MCCS: Given two graphs g 1 and g 2, the maximum connected common subgraph of g 1 and g 2 is the largest connected subgraph of g 1 which is subgraph isomorphic to g 2, denoted as mccs(g 1, g 2 ) Substructure Similarity Search: Given a graph database D = {g 1, g 2,..., g n }, a query graph q, and a subgraph distance threshold, the substructure similarity search is to retrieve all the graphs g i ∈ D with dist(q, g i ) ≤. Subgraph Distance: Given a query graph q and a data graph g, the Subgraph Distance is defined as, dist(q, g) = |q| − |mccs(q, g)| The graph size is defined as the number of edges. (# of missing edges from the query)
Framework
Feature-based exact subgraph search: overview Query Data Query Feature(Index) Data Pruning:
Query Data Query Feature(Index) Data Feature-based exact subgraph search: overview Pruning: Validation:
Similarity Search (triangular inequality) dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data
Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? 1 Similarity Search (triangular inequality)
Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data 1 2 dist(Q,F)+dist(F,D) ≥ dist(Q,D) ? Similarity Search (triangular inequality)
Query Data dist(Q,D) dist(Q,F) dist(F,D) Query Feature(Index) Data dist(Q,F)+dist(F,D) ≥ dist(Q,D) – hold! Similarity Search (triangular inequality)
dist(Q,F) dist(F,D) Query Feature(Index) Data Query Data dist(Q,D) dist(Q,F)+dist(F,D) ≥ dist(Q,D) X Triangular inequality: not always hold
dist(Q,F) dist(F,D) Query Feature(Index) Data Query Data dist(Q,D) Triangular inequality: not always hold dist(Q,F)+dist(F,D) ≥ dist(Q,D) X
Connectivity Dominance Connectivity Dominance: The connectivity of mccs(g 1, g 2 ) dominates the connectivity of g 2 if there is a subgraph isomorphic mapping from mccs(g 1, g 2 ) to g 2 such that if removing all the edges from this mapping, then all the vertices in the embedding mapping are disconnected. (i.e. The removing fully disconnected g 2.)
Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance
Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2
Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 mccs(g 1,g 2 ) not dominate g 2
Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 mccs(g 1,g 2 ) not dominate g 2 mccs(g 2,g 3 ) not dominate g 2
Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 3, g 2 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Connectivity Dominance g 1 =Query g 2 =Feature(Index) g 3 =Data Example 1 Example 2 mccs(g 2,g 3 ) dominates g 2 Count # of disconnected components: Linear Algorithm mccs(g 1,g 2 ) not dominate g 2 mccs(g 2,g 3 ) not dominate g 2
dist(Q,F)+dist(F,D) ≥ dist(Q,D) Validation Rule 1: dist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤ mccs(Q, F) dominates F or mccs(F, D) dominates F dist(Q,D)+dist(D,F) ≥ dist(Q,F) Pruning Rule 1: dist(Q,F)-dist(D,F)> => dist(Q,D)> mccs(D, F) dominates D dist(F,Q)+dist(Q,D) ≥ dist(F,D) Pruning Rule 2: dist(F, D)-dist(F, Q)> => dist(Q,D)> mccs(F, Q) dominates Q
Basic idea: 1. enumerate sub-spanning tree of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible. 2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. Convert Query to QI-Sequence [VLDB08] to favour earlier termination. Prefix = Induced subgraph 1.1 Infrequent Label (in all data graphs) First 1.2 Higher Degree Vertex (in the query graph) First 1.3 Dense Induced Subgraph (in the query graph) First Verification Algorithm
MCCS Detection Algorithm 1.Compute QI-Sequence
Verification Algorithm MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched)
Verification Algorithm Remove Edge B-D MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.
Verification Algorithm Remove Edge B-E MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.
Verification Algorithm Remove Edge B-F MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one.
Verification Algorithm Right Subtree MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched)
Verification Algorithm Remove Edge B-C MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched) 5.Generate new QI-Sequence from the existing one.
Verification Algorithm MCCS Detection Algorithm 1.Compute QI-Sequence 2.DFS: Threshold based DFS Search(A-B-C Matched) 3.Generate new QI-Sequence from the existing one. 4.DFS: Threshold based DFS Search (The second A-B Matched) 5.Generate new QI-Sequence from the existing one. 6.Terminate. (dist(q,g) ≤ 3)
Feature Selection Pruning Rule 1: mccs(D, F) dominates D Pruning Rule 2: mccs(F, Q) dominates Q =>F should be dense. =>Discriminative Frequent Induced Subgraph Validation Rule 1: mccs(F, D) dominates F or mccs(Q, F) dominates F =>F nearly contains Q and F should be sparse. =>Frequent Large Sparse Subgraphs Algorithm: gSpan[ICDM02] with our on-the-fly feature selection.
Experiments Settings CPUIntel Xeon 2.40GHz Memory4G SystemDebian Linux ComplierGNU GCC AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds
Experiments
Conclusion Thanks Connected Substructure Similarity Search 1.Measure: Maximum Connected Common Subgraph – MCCS 2.Connectivity Dominance => Triangular inequality 3.MCCS Detection Algorithm (Index, Filtering & Validation, Verification Techniques) Future Work: Large Graphs? New Measures?