1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Slides:

Advertisements

Similar presentations

Algorithms for computing Canonical labeling of Graphs and Sub-Graph Isomorphism.

Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Graph Mining Laks V.S. Lakshmanan

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

gSpan: Graph-based substructure pattern mining

www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.

Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.

13 May 2009Instructor: Tasneem Darwish1 University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department Introduction.

Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.

1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.

Introduction to Graph Mining

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Association Analysis (7) (Mining Graphs)

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Structure discovery in PPI networks using pattern-based network decomposition Philip Bachman and Ying Liu BIOINFORMATICS System biology Vol.25 no

Data Mining Association Analysis: Basic Concepts and Algorithms

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Fast Algorithms for Association Rule Mining

Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

What Is Sequential Pattern Mining?

Physical Mapping of DNA Shanna Terry March 2, 2004.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Topological Analysis in PPI Networks & Network Motif Discovery Jin Chen MSU CSE Fall 1.

Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.

7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.

An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者：蔡明瑾.

Based on slides by Y. Peng University of Maryland

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Indian Institute of Technology Kharagpur PALLAB DASGUPTA Graph Theory: Introduction Pallab Dasgupta, Professor, Dept. of Computer Sc. and Engineering,

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Basic Notions on Graphs. The House-and-Utilities Problem.

Mining Turbulence Data Ivan Marusic Department of Aerospace Engineering and Mechanics University of Minnesota Collaborators: Victoria Interrante, George.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

September1999 CMSC 203 / 0201 Fall 2002 Week #13 – 18/20/22 November 2002 Prof. Marie desJardins.

NP-Complete problems.

An Introduction to Graph Theory

Graphs Basic properties.

Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)

+ GRAPH Algorithm Dikompilasi dari banyak sumber.

Graph Indexing From managing and mining graph data.

Introduction to NP Instructor: Neelima Gupta 1.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.

Week 11 - Wednesday.  What did we talk about last time?  Graphs  Paths and circuits.

Gspan: Graph-based Substructure Pattern Mining

Mining in Graphs and Complex Structures

Data Mining Association Rules: Advanced Concepts and Algorithms

Special Graphs By: Sandeep Tuli Astt. Prof. CSE.

Frequent Pattern Mining

Introduction to Graph Theory

Algorithms and networks

Mining Frequent Subgraphs

Based on slides by Y. Peng University of Maryland

Graph Database Mining and Its Applications

Mining Frequent Subgraphs

Algorithms and networks

FP-Growth Wenlong Zhang.

Mining Frequent Subgraphs

Applied Discrete Mathematics Week 13: Graphs

Based on slides by Y. Peng University of Maryland

Approximate Graph Mining with Label Costs

Presentation transcript:

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010

Modeling Data With Graphs… Going Beyond Transactions Graphs are suitable for capturing arbitrary relations between the various elements. VertexElement Element’s Attributes Relation Between Two Elements Type Of Relation Vertex Label Edge Label Edge Data InstanceGraph Instance Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled

3 Graph, Graph, Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network

4 Frequent Subgraph Discovery - Proposed in ICDM 2001 Given D : a set of undirected, labeled graphs σ : support threshold ; 0 < σ <= 1 Find all connected, undirected graphs that are subgraphs in at-least σ. | D | of input graphs  Subgraph isomorphism

October 25, Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2)

October 25, EXAMPLE (II) GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2)

7 Terminology-I A graph G(V,E) is made of two sets  V: set of vertices  E: set of edges Assume undirected, labeled graphs  L v : set of vertex labels  L E : set of edge labels

8 Terminology-II A graph is said to be connected if there is a path between every pair of vertices A graph G s (V s, E s ) is a subgraph of another graph G(V, E) iff  V s is subset of V and E s is subset of E Two graphs G 1 (V 1, E 1 ) and G 2 (V 2, E 2 ) are isomorphic if they are topologically identical  There is a mapping from V 1 to V 2 such that each edge in E 1 is mapped to a single edge in E 2 and vice-versa

9 Example of Graph Isomorphism

10 Terminology-III: Subgraph isomorphism problem Given two graphs G 1 (V 1, E 1 ) and G 2 (V 2, E 2 ): find an isomorphism between G 2 and a subgraph of G 1  There is a mapping from V 1 to V 2 such that each edge in E 1 is mapped to a single edge in E 2 and vice-versa NP-complete problem  Reduction from max-clique or hamiltonian cycle problem

FSG: Frequent Subgraph Discovery Algorithm Follows an Apriori-style level-by-level approach and grows the patterns one edge-at-a-time.

12 FSG: Frequent Subgraph Discovery Algorithm Key elements for FSG’s computational scalability  Improved candidate generation scheme  Use of TID-list approach for frequency counting  Efficient canonical labeling algorithm

13 FSG: Basic Flow of the Algo. Enumerate all single and double-edge subgraphs Repeat  Generate all candidate subgraphs of size (k+1) from size-k subgraphs  Count frequency of each candidate  Prune subgraphs which don’t satisfy support constraint Until (no frequent subgraphs at (k+1) )

14 FSG: Candidate Generation - I Join two frequent size-k subgraphs to get (k+1) candidate  Common connected subgraph of (k-1) necessary Problem  K different size (k-1) subgraphs for a given size-k graph  If we consider all possible subgraphs, we will end up Generating same candidates multiple times Generating candidates that are not downward closed Significant slowdown  Apriori doesn’t suffer this problem due to lexicographic ordering of itemset

15 FSG: Candidate Generation - II Joining two size-k subgraphs may produce multiple distinct size-k  CASE 1: Difference can be a vertex with same label

16 FSG: Candidate Generation - III CASE 2: Primary subgraph itself may have multiple automorphisms CASE 3: In addition to joining two different k-graphs, FSG also needs to perform self-join

17 FSG: Candidate Generation Scheme For each frequent size-k subgraph F i, define primary subgraphs: P(F i ) = {H i,1, H i,2 } H i,1, H i,2 : two (k-1) subgraphs of F i with smallest and second smallest canonical label FSG will join two frequent subgraphs F i and F j iff P(F i ) ∩ P(F j ) ≠ Φ This approach (TKDE 2004) correctly generates all valid candidates and leads to significant performance improvement over the ICDM 2001 paper

18 FSG: Frequency Counting Naïve way  Subgraph isomorphism check for each candidate against each graph transaction in database  Computationally expensive and prohibitive for large datasets FSG uses transaction identifier (TID) lists  For each frequent subgraph, keep a list of TID that support it To compute frequency of G k+1  Intersection of TID list of its subgraphs  If size of intersection < min_support, prune G k+1  Else Subgraph isomorphism check only for graphs in the intersection Advantages  FSG is able to prune candidates without subgraph isomorphism  For large datasets, only those graphs which may potentially contain the candidate are checked

19 Canonical label of graph Lexicographically largest (or smallest) string obtained by concatenating upper triangular entries of adjacency matrix (after symmetric permutation) Uniquely identifies a graph and its isomorphs  Two isomorphic graphs will get same canonical label

20 Use of canonical label FSG uses canonical labeling to  Eliminate duplicate candidates  Check if a particular pattern satisfies monotonicity. Naïve approach for finding out canonical label is O( |v| !)  Impractical even for moderate size graphs

21 FSG: canonical labeling Vertex invariants  Inherent properties of vertices that don’t change across isomorphic mappings  E.g. degree or label of a vertex Use vertex invariants to partition vertices of a graph into equivalent classes If vertex invariants cause m partitions of V containing p1, p2, …, pm vertices respectively, then number of different permutations for canonical labeling π (p i !) ; i = 1, 2, …, m which can be significantly smaller than |V| ! permutations

22 FSG canonical label: vertex invariant Partition based on vertex degrees and labels Example: number of permutations = 1 ! x 2! x 1! = 2 Instead of 4! = 24

23 Next steps What are possible applications that you can think of?  Chemistry  Biology We have only looked at “frequent subgraphs”  What are other measures for similarity between two graphs?  What graph properties do you think would be useful?  Can we do better if we impose restrictions on subgraph? Frequent sub-trees Frequent sequences Frequent approximate sequences

References Jiawei Han. Graph mining: Part I Graph Pattern Mining. George Karypis. Mining Scientific Data Sets Using Graphs. Sangameshwar Patil. Introduction to Graph Mining. 24