Mining Frequent Subgraphs

Slides:



Advertisements
Similar presentations
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Advertisements

gSpan: Graph-based substructure pattern mining
Review Binary Search Trees Operations on Binary Search Tree
Introduction to Graph Mining
Mining Graphs.
GOLOMB RULERS AND GRACEFUL GRAPHS
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Association Analysis (7) (Mining Graphs)
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
Graphs & Graph Algorithms 2
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects two different vertices. Edges are.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Graph Operations And Representation. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Sequential PAttern Mining using A Bitmap Representation
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾.
SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Data Structures & Algorithms Graphs
Chapter 7. Trees Weiqi Luo ( 骆伟祺 ) School of Software Sun Yat-Sen University : Office : A309
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Data Structures & Algorithms Graphs Richard Newman based on book by R. Sedgewick and slides by S. Sahni.
Graphs 황승원 Fall 2010 CSE, POSTECH. 2 2 Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Union By Rank Ackermann’s Function Graph Algorithms Rajee S Ramanikanthan Kavya Reddy Musani.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Gspan: Graph-based Substructure Pattern Mining
Mining in Graphs and Complex Structures
Lecture 11 Graph Algorithms
CSCE 411 Design and Analysis of Algorithms
Chapter 5 : Trees.
Graphs.
Csc 2720 Instructor: Zhuojun Duan
Mining Frequent Subgraphs
Depth-First Search.
CS120 Graphs.
Graph Algorithms Using Depth First Search
Graph Operations And Representation
Graphs & Graph Algorithms 2
Mining Complex Data COMP Seminar Spring 2011.
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
Mining and Searching Graphs in Biological Databases
Lectures on Graph Algorithms: searching, testing and sorting
Graph Operations And Representation
Backtracking and Branch-and-Bound
Graph Operations And Representation
Important Problem Types and Fundamental Data Structures
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Lecture 10 Graph Algorithms
Approximate Graph Mining with Label Costs
Chapter 9 Graph algorithms
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2007

Overview Introduction Finding recurring subgraphs from graph databases. gSpan FFSM 1L06 Left: social network right protein structure 2/25/2019

Labeled Graph We define a labeled graph G as a five element tuple G = {V, E, V, E, } where V is the set of vertices of G, E  V V is a set of undirected edges of G, V (E) are set of vertex (edge) labels,  is the labeling function: V  V and E  E that maps vertices and edges to their labels. p2 p5 a b d y x (P) p1 p3 p4 c a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 2/25/2019

Frequent Subgraph Mining Input: A set GD of labeled undirected graphs p2 p5 a b d y x (P) p1 p3 p4 c  = 2/3 a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 Output: All frequent subgraphs (w. r. t. ) from GD. a b y x a b a b y x a b y x 2/25/2019

Finding Frequent Subgraphs Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. Isomorphic subgraphs are considered the same subgraph. Apriori approaches Generation of subgraph candidates is complicated and expensive. Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

gSpan DFS without candidate generation DFS Representation Relabels graph representation to support DFS. Discovers all frequent subgraphs without candidate generation or pruning. DFS Representation Map each graph to a DFS code (sequence). Lexicographically order the codes. Construct a search tree based on the lexicographic order.

Depth-First Search Tree Three depth-first search trees of figure a. vi is the visitation order. Dotted lines are visits back to a visited node. Rightmost path is the path from v0->vN. (a) (b) (c) (d)

DFS Codes (start_index, end_index, start_label, edge_label, end_label) Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: i1 = i2 && j1 < j2 i1 < j1 && j1 = i2 code(G,T) = edge sequence of ei < ei+1 (a) (b) edge (b) (c) (d) (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X) 1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y) 2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X) 3 (2,3,X,c,Z) (2,3,Y,b,Z) 4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X) 5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z) (start_index, end_index, start_label, edge_label, end_label) e1 < e2 if start at same node and e1 ends at an earlier visited node. e1 < e2 if e1 starts at an earlier node than e2 ends at, and e1 ends at the node e2 starts at. G = graph. T = DFS tree of G. (c) (d)

DFS Lexicographic Order ∂ = code(G∂,T∂) = (a0,a1,…,am) ß = code(Gß,Tß) = (b0,b1,…,bn) ∂ ≤ ß iff (1) or (2): (1) (2) Minimum DFS code The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. Graphs A and B are isomorphic if min(A) = min(B). 1. A prefix sequence of ∂ and ß is equal, and at <e bt. 2. ß is longer than ∂, but their common prefix elements are equal.

DFS Codes: Parents and Children If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): ß is the child of ∂. ∂ is the parent of ß. A valid DFS code requires that b grows from a vertex on the rightmost path. Rightmost path is the path from v0->vN in the DFS.

DFS Code Trees Organize DFS code nodes as parent-child. Pre-order traversal follows DFS lexicographic order. If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpan D is the set of all graphs. S is the result set. Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c) Vertices = {A,B,C,…}. Edges = {a,b,c,…}. Subgraph_Mining grows all nodes in the subtree rooted at s. Each foreach iteration finds all frequent subgraphs (A,a,A), then (B,b,C) without (A,a,A).

Runtime: Synthetic Runtime (sec)

Runtime: Chemical Apriori (FSG) gSpan Runtime (sec) Support Threshold (%) 1000 100 10 1 0 5 10 15 20 25 30 340 chemical compounds, 24 different atoms, 66 atom types, 4 bond types. Sparse: average 27 vertices, 28 edges per graph. Largest 214 vertices, 214 edges.

gSpan Advantages Lower memory requirements. Faster than naïve FSG by an order of magnitude. No candidate generation. Lexicographic ordering minimizes search tree. False positives pruning. Any disadvantage? <100MB for chemical; FSG ran out of memory for support < 5%. Faster: synthetic 6-30 times; chemical 15-100 times. FSM (Apriori) takes 10 minutes to process a dataset with 6.5% minimum support. gSpan completes in 10 seconds.

FFSM: Fast Frequent Subgraph Mining -- An Overview: How to solve graph isomorphism problem? A Novel Graph Canonical Form: CAM How to tackle subgraph isomorphism problem (NP-complete)? Incrementally maintained embeddings How to enumerate subgraphs: An Efficient Data Structure: CAM Tree Two Operations: CAM-join, CAM-extension. FSG: level wise search gSpan: depth first search 2/25/2019

Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge. p2 p5 a b d y x (P) p1 p3 p4 c M1 y b d c x a M2 y b c d x a M3 d b x a y c 1for an undirected graph, the upper triangle is always a mirror of the lower triangle 2/25/2019

Code A Code of n  n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order: M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n M1 y b d c x a a M3 d b x a y c Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00a y b y x b y d y c M2 The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order. 2/25/2019

MP Submatrix For an m  m matrix A, an n  n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M. M6 y b d c x a M5 b y c x a M2 b y a M1 M3 M4 b y x a We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAM Theorem II: A connected CAM’s MP submatrix is connected Also explain, the symmetric property and we also remove the row if there is no edge entry left. 2/25/2019

CAM Tree: Subgraphs y x p2 p5 a b d (P) p1 p3 p4 c 2/25/2019 b c d a b y a a a a d y b a c y b x b y y b b y b x b x x b b y c y d b x y a y a c b d y b a y a c b x d y b x a y a c b x d y b x a d y c b x The first blink object can be obtained by “superimposing” two objects above it. The second one can not. This is the motivation for suboptimal CAMs p2 p5 a b d y x (P) p1 p3 p4 c y a c b x y a d b x y b d c a y b d c x a y b c d x a 2/25/2019

CAM Tree: Frequent Subgraphs x y a = 2/3 p2 p5 a b d y x (P) p1 p3 p4 c a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 2/25/2019

How to Enumerate Nodes in a CAM Tree? Two operations to explore CAM tree: CAM-Join CAM-Extension Augmenting CAM tree with Suboptimal CAMs Objectives: none false dismissal no redundancy Plus: We want to this efficiently! 2/25/2019

Suboptimal Tree We define a Suboptimal CAM as a matrix that its MP submatrix is a CAM. b c d a a b c y b b y b x b y d b x y a c d j e j e c y b x d j y a c b x d e j d y c b x p2 p5 a b d y x (P) p1 p3 p4 c May explain depth first search and using information from siblings. We don’t show the biggest one which has five edges in it due to space limitation y b d c x a j 2/25/2019

Summary Theorem: For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1. 2/25/2019

Experimental Study Predictive Toxicology Evaluation Competition (PTE) Contains: 337 compounds Each graph contains 27 nodes and 27 edges on average NIH DTP Anti-Viral Screen Test (DTP CA/CM) Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI). We formed a dataset contains CA (423) and CM (1083). Each graph contains 25 nodes and 27 edges on average 2/25/2019

Performance (PTE) Support Threshold (%) Support Threshold (%) 2/25/2019

Performance (DTP CACM) Support Threshold (%) Support Threshold (%) 2/25/2019