Mining Frequent Subgraphs

Slides:



Advertisements
Similar presentations
Graph Algorithms Algorithm Design and Analysis Victor AdamchikCS Spring 2014 Lecture 11Feb 07, 2014Carnegie Mellon University.
Advertisements

Graph Mining Laks V.S. Lakshmanan
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
gSpan: Graph-based substructure pattern mining
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Introduction to Graph Mining
CMPS 2433 Discrete Structures Chapter 5 - Trees R. HALVERSON – MIDWESTERN STATE UNIVERSITY.
Mining Graphs.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (7) (Mining Graphs)
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Slides are modified from Jiawei Han & Micheline Kamber
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Advanced Association Rule Mining and Beyond. Continuous and Categorical Attributes Example of Association Rule: {Number of Pages  [5,10)  (Browser=Mozilla)}
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾.
SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Trees 2: Section 4.2 and 4.3 Binary trees. Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Graphs Chapter 20.
Trees Chapter 15.
Mining in Graphs and Complex Structures
Chapter 5 : Trees.
Discrete Mathematicsq
12. Graphs and Trees 2 Summary
i206: Lecture 13: Recursion, continued Trees
Depth-First Search.
Topological Sort (topological order)
CS120 Graphs.
Market Basket Analysis and Association Rules
Graph Algorithms Using Depth First Search
Spanning Trees Longin Jan Latecki Temple University based on slides by
Mining Complex Data COMP Seminar Spring 2011.
Graphs Chapter 13.
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
CSE 373 Data Structures and Algorithms
Mining and Searching Graphs in Biological Databases
Lectures on Graph Algorithms: searching, testing and sorting
Elementary Graph Algorithms
Depth-First Search D B A C E Depth-First Search Depth-First Search
Subgraphs, Connected Components, Spanning Trees
Slides are modified from Jiawei Han & Micheline Kamber
Chapter 14 Graphs © 2006 Pearson Addison-Wesley. All rights reserved.
Mining Frequent Subgraphs
Graph Algorithms "A charlatan makes obscure what is clear; a thinker makes clear what is obscure. " - Hugh Kingsmill CLRS, Sections 22.2 – 22.4.
CSE 373: Data Structures and Algorithms
Spanning Trees Longin Jan Latecki Temple University based on slides by
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Finding Frequent Itemsets by Transaction Mapping
Approximate Graph Mining with Label Costs
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2011

Overview Introduction Finding recurring subgraphs from graph databases. FSG gSpan FFSM 1L06 Left: social network right protein structure 9/18/2018

Labeled Graph We define a labeled graph G as a five element tuple G = {V, E, V, E, } where V is the set of vertices of G, E  V V is a set of undirected edges of G, V (E) are set of vertex (edge) labels,  is the labeling function: V  V and E  E that maps vertices and edges to their labels. p2 p5 a b d y x (P) p1 p3 p4 c a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 9/18/2018

Frequent Subgraph Mining Input: A set GD of labeled undirected graphs p2 p5 a b d y x (P) p1 p3 p4 c  = 2/3 a b y x (Q) q1 q3 q2 a b y (S) s1 s3 s2 Output: All frequent subgraphs (w. r. t. ) from GD. a b y x a b a b y x a b y x 9/18/2018

Finding Frequent Subgraphs Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least  graphs. Isomorphic subgraphs are considered the same subgraph. Apriori approaches Generation of subgraph candidates is complicated and expensive. Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

Apriori-Based, Breadth-First Search Methodology: breadth-search, joining two graphs AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge 9/18/2018

FSG Algorithm K = 1 F1 = all frequent edges Repeat K = K + 1; CK = join(FK-1) FK = frequent patterns in CK Until FK is empty 9/18/2018

Join: Key Operation Join(L) =  join(P, Q) for all P, Q  L Join(P, Q) = {G | P, Q,  G, |G| = |P| + 1, |P| = |Q|} Two graphs P and Q are joinable if the join of the two graphs produces an non-empty set Theorem: two graphs P and Q are joinable if P ∩ Q is a graph with size |P| -1 or share a common “core” with size P-1 9/18/2018

Multiplicity of Candidates Case 1: identical vertex labels e a b a b e a a + e e b b e a b a a 9/18/2018

Multiplicity of Candidates Case 2: Core contains identical labels a c b + a b c a c b a c b Core: The (k-1) subgraph that is common between the joint graphs 9/18/2018

Multiplicity of Candidates Case 3: Core multiplicity a b a b a b + a b a b 9/18/2018

PATH Apriori-based approach Building blocks: edge-disjoint path Identify all frequent paths Construct frequent graphs with 2 edge-disjoint paths Construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths Repeat A graph with 3 edge-disjoint paths 9/18/2018

PATH Algorithm K = 1 F1 = all frequent paths Repeat K = K + 1; CK = join(FK-1) FK = frequent patterns in CK Until FK is empty 9/18/2018

Challenges Graph isomorphism Subgraph isomorphism Two graphs may have the same topology though their layouts are different Subgraph isomorphism How to compute the support value of a pattern 9/18/2018

Graph Isomorphism A graph is isomorphic if it is topologically equivalent to another graph 9/18/2018

Why Redundant Candidates? All the algorithms may propose the same candidate several times. We need to keep track of the identical candidates to Avoid redundancy in results Avoid redundant search 9/18/2018

gSpan DFS without candidate generation DFS Representation Relabels graph representation to support DFS. Discovers all frequent subgraphs without candidate generation or pruning. DFS Representation Map each graph to a DFS code (sequence). Lexicographically order the codes. Construct a search tree based on the lexicographic order.

Depth-First Search Tree Three depth-first search trees of figure a. vi is the visitation order. Dotted lines are visits back to a visited node. Rightmost path is the path from v0->vN. (a) (b) (c) (d)

DFS Codes (start_index, end_index, start_label, edge_label, end_label) Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if: i1 = i2 && j1 < j2 i1 < j1 && j1 = i2 code(G,T) = edge sequence of ei < ei+1 (a) (b) edge (b) (c) (d) (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X) 1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y) 2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X) 3 (2,3,X,c,Z) (2,3,Y,b,Z) 4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X) 5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z) (start_index, end_index, start_label, edge_label, end_label) e1 < e2 if start at same node and e1 ends at an earlier visited node. e1 < e2 if e1 starts at an earlier node than e2 ends at, and e1 ends at the node e2 starts at. G = graph. T = DFS tree of G. (c) (d)

DFS Lexicographic Order ∂ = code(G∂,T∂) = (a0,a1,…,am) ß = code(Gß,Tß) = (b0,b1,…,bn) ∂ ≤ ß iff (1) or (2): (1) (2) Minimum DFS code The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G. Graphs A and B are isomorphic if min(A) = min(B). 1. A prefix sequence of ∂ and ß is equal, and at <e bt. 2. ß is longer than ∂, but their common prefix elements are equal.

DFS Codes: Parents and Children If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b): ß is the child of ∂. ∂ is the parent of ß. A valid DFS code requires that b grows from a vertex on the rightmost path. Rightmost path is the path from v0->vN in the DFS.

DFS Code Trees Organize DFS code nodes as parent-child. Pre-order traversal follows DFS lexicographic order. If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpan D is the set of all graphs. S is the result set. Algorithm 1: GraphSet_Projection(D,S) 1: sort labels in D by frequency 2: remove infrequent vertices and edges 3: relabel remaining vertices and edges 4: S’ = all frequent 1-edge graphs in D 5: sort S’ in DFS lexicographic order 6: S = S’ 7: foreach edge e in S’ do 8: s = graph defined by e 9: s.D = subgraphs in D containing e 10: Subgraph_Mining(D,S,s) 11: D = D - e 12: if |D| < minSup 13: break Subprocedure 1: Subgraph_Mining(D,S,s) 1: if s != min(s) 2: return 3: S = S U {s} 4: s’ = +1-edge children of s in s.D 5: foreach child c of s’ do 6: if support(c) ≥ minSup 7: Subgraph_Mining(Ds,S,c) Vertices = {A,B,C,…}. Edges = {a,b,c,…}. Subgraph_Mining grows all nodes in the subtree rooted at s. Each foreach iteration finds all frequent subgraphs (A,a,A), then (B,b,C) without (A,a,A).

Runtime: Synthetic Runtime (sec)

Runtime: Chemical Apriori (FSG) gSpan Runtime (sec) Support Threshold (%) 1000 100 10 1 0 5 10 15 20 25 30 340 chemical compounds, 24 different atoms, 66 atom types, 4 bond types. Sparse: average 27 vertices, 28 edges per graph. Largest 214 vertices, 214 edges.

gSpan Advantages Lower memory requirements. Faster than naïve FSG by an order of magnitude. No candidate generation. Lexicographic ordering minimizes search tree. False positives pruning. Any disadvantage? <100MB for chemical; FSG ran out of memory for support < 5%. Faster: synthetic 6-30 times; chemical 15-100 times. FSM (Apriori) takes 10 minutes to process a dataset with 6.5% minimum support. gSpan completes in 10 seconds.