An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾
Introduction Structural pattern Biology, chemistry Chemical compounds graph vertex – item edge – relation between items Undirected connected labeled graph b a x a y x
Graph Isomorphism b a x a x y a b x a y x G1(V1,E1) and G2(V2,E2) are topologically identical to each other. There is a mapping from v1 to v2,such that each edge in E1 is mapped to E2 and vice versa. v0v0 v1v1 v2v2 v0v0 v1v1 v2v2 =
Canonical labeling Adjacency list b a x a x y v0v0 v1v1 v2v2 v0v0 v1v1 v2v2 v0bv0b v1av1a v2av2a x x x y x y code = baaxxy a b x a y x v0v0 v1v1 v2v2 v0v0 v1v1 v2v2 v0av0a v1bv1b v2av2a x y x x y x code = abaxyx ||
Canonical labeling Different permutation of vertices lead to different canonical label. |v|! Largest codes
Vertex invariants Properties don ’ t change across isomorphism mappings. Vertex degree Vertex label siblings b a x a x y
Vertex Degrees and Labels Adjacency Matrix Partitioning verteices by degrees and labels that every partition contains vertices with same degree and label
Degree : p0={v0,v1,v3}:2 Degree+label : p0={ v1,v2}:(2,a),p1={v0}:(2,b) Vertex Degrees and Labels b a x a x y v0v0 v1v1 v2v2 v0v0 v1v1 v2v2 v0bv0b v1av1a v2av2a x x x y x y code = baaxxy
Vertex Degrees and Labels b a x a x y v0v0 v1v1 v2v2 v1v1 v2v2 v0v0 v1av1a v2av2a v0bv0b y x y x x x code = aabyxx p0={ v1,v2}:2,a,p1={v0}:2,b 原本: 3! 現在: 2!x 1!
Running example minsup = g0g1g2 Tid_list{0,1,2}{0,2}{0,1}{2} cl Frequent 1_subgraph
Running example minsup =2 tid{0,1,2} cl010 child {0,2} 021 {0,1} Possible tid {0,1,2} c0 c2 c3 {0,2} {0,1} c1 {0,1,2} c0,c1,c2,c3 c2 c3 ……
c2 c c1 tid {0,2}{0,1,2}{0,1} cl 01201x10000x10203x21133x c4 tid{0,1,2} cl010 child c1,c2,c3 {0,2} 021 {0,1} c2 c3,c4 Frequent 2_subgraph
Frequency computing Id-list Intersection two k-subgraph ’ s id-list Frequent->find the support Not frequent -> pruned
Candidate generation Joining two frequent k-subgraph ->k+1 candidate subgraph Having same k-1 core Vertex labeling Multiple cores Multiple automorphisms
Vertex labeling
Multiple automorphism
Multiple cores
c2c c c4 tid{0,1,2} cl010 child c1,c2,c3 {0,2} 021 {0,1} c2 c3,c q1 tid {0,2} cl 01201x child {0,1,2} 10000x {0,1} 10203x {0,1} 21133x Possible tid {0, 2} q0,q1 q q {0,} q {0, 2} 不符合 downward closure
Experiment AMD 1.53GHz 2GB main memory Linux OS chemical compound: PTE(340),66 atom types and four bond types,27 edges/graph on average DTP(223,644),104 atom types and three bound types and 22 edges/graph on average Synthetic datasets
PTE and DTP
Synthetic datasets
Synthetic datasets |D|=10000,|S|=200,|L E |=1,minsup=2%