Mining Frequent Subgraphs

Mining Frequent Subgraphs
COMP Seminar Spring 2007

Overview Introduction
Finding recurring subgraphs from graph databases. gSpan FFSM 1L06 Left: social network right protein structure 2/25/2019

Labeled Graph We define a labeled graph G as a five element tuple G = {V, E, V, E, } where V is the set of vertices of G, E  V V is a set of undirected edges of G, V (E) are set of vertex (edge) labels,  is the labeling function: V  V and E  E that maps vertices and edges to their labels. p2 p5 a b d y x (P) p1 p3 p4 c a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 2/25/2019

Frequent Subgraph Mining
Input: A set GD of labeled undirected graphs p2 p5 a b d y x (P) p1 p3 p4 c  = 2/3 a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 Output: All frequent subgraphs (w. r. t. ) from GD. a b y x a b a b y x a b y x 2/25/2019

Finding Frequent Subgraphs
Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. Isomorphic subgraphs are considered the same subgraph. Apriori approaches Generation of subgraph candidates is complicated and expensive. Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

gSpan DFS without candidate generation DFS Representation
Relabels graph representation to support DFS. Discovers all frequent subgraphs without candidate generation or pruning. DFS Representation Map each graph to a DFS code (sequence). Lexicographically order the codes. Construct a search tree based on the lexicographic order.

Depth-First Search Tree
Three depth-first search trees of figure a. vi is the visitation order. Dotted lines are visits back to a visited node. Rightmost path is the path from v0->vN. (a) (b) (c) (d)

DFS Codes (start_index, end_index, start_label, edge_label, end_label)
Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: i1 = i2 && j1 < j2 i1 < j1 && j1 = i2 code(G,T) = edge sequence of ei < ei+1 (a) (b) edge (b) (c) (d) (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X) 1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y) 2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X) 3 (2,3,X,c,Z) (2,3,Y,b,Z) 4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X) 5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z) (start_index, end_index, start_label, edge_label, end_label) e1 < e2 if start at same node and e1 ends at an earlier visited node. e1 < e2 if e1 starts at an earlier node than e2 ends at, and e1 ends at the node e2 starts at. G = graph. T = DFS tree of G. (c) (d)

DFS Lexicographic Order
∂ = code(G∂,T∂) = (a0,a1,…,am) ß = code(Gß,Tß) = (b0,b1,…,bn) ∂ ≤ ß iff (1) or (2): (1) (2) Minimum DFS code The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. Graphs A and B are isomorphic if min(A) = min(B). 1. A prefix sequence of ∂ and ß is equal, and at <e bt. 2. ß is longer than ∂, but their common prefix elements are equal.

DFS Codes: Parents and Children
If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): ß is the child of ∂. ∂ is the parent of ß. A valid DFS code requires that b grows from a vertex on the rightmost path. Rightmost path is the path from v0->vN in the DFS.

DFS Code Trees Organize DFS code nodes as parent-child.
Pre-order traversal follows DFS lexicographic order. If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpan D is the set of all graphs. S is the result set.
Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c) Vertices = {A,B,C,…}. Edges = {a,b,c,…}. Subgraph_Mining grows all nodes in the subtree rooted at s. Each foreach iteration finds all frequent subgraphs (A,a,A), then (B,b,C) without (A,a,A).

Runtime: Synthetic Runtime (sec)

Runtime: Chemical Apriori (FSG) gSpan
Runtime (sec) Support Threshold (%) 1000 100 10 1 340 chemical compounds, 24 different atoms, 66 atom types, 4 bond types. Sparse: average 27 vertices, 28 edges per graph. Largest 214 vertices, 214 edges.

gSpan Advantages Lower memory requirements.
Faster than naïve FSG by an order of magnitude. No candidate generation. Lexicographic ordering minimizes search tree. False positives pruning. Any disadvantage? <100MB for chemical; FSG ran out of memory for support < 5%. Faster: synthetic 6-30 times; chemical times. FSM (Apriori) takes 10 minutes to process a dataset with 6.5% minimum support. gSpan completes in 10 seconds.

FFSM: Fast Frequent Subgraph Mining -- An Overview:
How to solve graph isomorphism problem? A Novel Graph Canonical Form: CAM How to tackle subgraph isomorphism problem (NP-complete)? Incrementally maintained embeddings How to enumerate subgraphs: An Efficient Data Structure: CAM Tree Two Operations: CAM-join, CAM-extension. FSG: level wise search gSpan: depth first search 2/25/2019

Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge. p2 p5 a b d y x (P) p1 p3 p4 c M1 y b d c x a M2 y b c d x a M3 d b x a y c 1for an undirected graph, the upper triangle is always a mirror of the lower triangle 2/25/2019

Code A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order: M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n M1 y b d c x a a M3 d b x a y c Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00a y b y x b y d y c M2 The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order. 2/25/2019

MP Submatrix For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M. M6 y b d c x a M5 b y c x a M2 b y a M1 M3 M4 b y x a We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAM Theorem II: A connected CAM’s MP submatrix is connected Also explain, the symmetric property and we also remove the row if there is no edge entry left. 2/25/2019

CAM Tree: Subgraphs y x p2 p5 a b d (P) p1 p3 p4 c 2/25/2019 b c d a b
y a a a a d y b a c y b x b y y b b y b x b x x b b y c y d b x y a y a c b d y b a y a c b x d y b x a y a c b x d y b x a d y c b x The first blink object can be obtained by “superimposing” two objects above it. The second one can not. This is the motivation for suboptimal CAMs p2 p5 a b d y x (P) p1 p3 p4 c y a c b x y a d b x y b d c a y b d c x a y b c d x a 2/25/2019

CAM Tree: Frequent Subgraphs
x y a = 2/3 p2 p5 a b d y x (P) p1 p3 p4 c a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 2/25/2019

How to Enumerate Nodes in a CAM Tree?
Two operations to explore CAM tree: CAM-Join CAM-Extension Augmenting CAM tree with Suboptimal CAMs Objectives: none false dismissal no redundancy Plus: We want to this efficiently! 2/25/2019

Suboptimal Tree We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM. b c d a a b c y b b y b x b y d b x y a c d j e j e c y b x d j y a c b x d e j d y c b x p2 p5 a b d y x (P) p1 p3 p4 c May explain depth first search and using information from siblings. We don’t show the biggest one which has five edges in it due to space limitation y b d c x a j 2/25/2019

Summary Theorem: For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1. 2/25/2019

Experimental Study Predictive Toxicology Evaluation Competition (PTE)
Contains: 337 compounds Each graph contains 27 nodes and 27 edges on average NIH DTP Anti-Viral Screen Test (DTP CA/CM) Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI). We formed a dataset contains CA (423) and CM (1083). Each graph contains 25 nodes and 27 edges on average 2/25/2019

Performance (PTE) Support Threshold (%) Support Threshold (%)
2/25/2019

Performance (DTP CACM)
Support Threshold (%) Support Threshold (%) 2/25/2019

Mining Frequent Subgraphs

Similar presentations

Presentation on theme: "Mining Frequent Subgraphs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Frequent Subgraphs

Similar presentations

Presentation on theme: "Mining Frequent Subgraphs"— Presentation transcript:

Similar presentations

About project

Feedback