University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent Subgraph/ Substructure Mining Seminar 2009
University at BuffaloThe State University of New York Outline Introduction Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary
University at BuffaloThe State University of New York Graphs are everywhere
University at BuffaloThe State University of New York Graph Mining Problems Graph Pattern Mining Frequent subgraph pattern mining Pattern summarization Optimal graph patterns Graph patterns with constraints Approximate graph patterns …. Graph Classification Graph clustering Important node identification Bridge and hub identification Other Important Topics Graph compression Graph model Social network analysis.
University at BuffaloThe State University of New York Subgraph pattern Mining Frequent subgraph A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Application of subgraph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classifiction, clustering,compression, comparison and correlation analysis.
University at BuffaloThe State University of New York (1) (2) (3) B C A A B A A B C C B C A A A subgraph 331 Support Frequent Subgraph Example
University at BuffaloThe State University of New York Key Challenges in Subgraph Mining Graph isomorphism to detect if two graphs are identical in structure Graph representation (Canonical Labeling) A canonical label is a unique code of a given graph. Canonical label should be the same no matter how graphs are represented, as long as graphs have the same topological structure and the same labeling of edges and vertices. Subgraph candidate generation generate candidate frequent subgraphs from datasets
University at BuffaloThe State University of New York Subgraph Mining Approaches Apriori-based AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages , Nov PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) FTOSM: Horvath et al. (KDD’06) Pattern growth based Subdue: Holder et al. (KDD’94) MoFa: Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 Gaston: Nijssen and Kok (KDD’04) CMTreeMiner: Chi et al. (TKDE’05) LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York Outline Introduction and Background Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary
University at BuffaloThe State University of New York Apriori-based Approach FSG : Frequent subgraph discovery. In ICDM’01, Nov M.Kuramochi and G. Karypis. Flattened Representation as Canonical Labeling Apriori-based method to generate subgraph candidate
University at BuffaloThe State University of New York Graph Representation in FSG Flattened Representation
University at BuffaloThe State University of New York Graph Representation in FSG Flatterned Representation Lexicographic order or dictionary order
University at BuffaloThe State University of New York Apriori-based method Apriori Property If a graph is frequent, all of its subgraphs are frequent. Candidate Generation Create a set of candidate size k+1 -from given two frequent k- subgraphs -containing the same (k-1)- subgraph -Result in several candidates size k+1
University at BuffaloThe State University of New York Apriori-based method Graph candidate generated Example
University at BuffaloThe State University of New York Apriori-based method FlowChart
University at BuffaloThe State University of New York Apriori-based method Experiment Result - Chemical Compound Dataset, which contains 340 compounds,24 different atoms (vertices)
University at BuffaloThe State University of New York Outline Introduction Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary
University at BuffaloThe State University of New York Motivation of gSpan Weakness of Apriori-based approach The generation of size (k+1) subgraph candidates from size k frequent subgraph too complicated and complex. Pruning false positive : subgraph isomorphism is an NP complete problem which is costly. gSpan: Graph-Based Substructure Pattern Mining Change the way to represent a graph (DFS: Depth First Search) Using pattern growth to generate new subgraph candidate.
University at BuffaloThe State University of New York gSpan: Graph-Based Substructure Pattern Mining DFS (Depth First Search) Code First Step: DFS the graph and use edges on the path to represent the graph. Second Step: DFS Lexicographic Order Pattern Growth subgraph generation
University at BuffaloThe State University of New York DFS code An edge is presented by 5 tuples.
University at BuffaloThe State University of New York DFS code Second Step: DFS Lexicographic Order
University at BuffaloThe State University of New York Pattern Growth Approach Pattern Growth (free extension)
University at BuffaloThe State University of New York Pattern Growth Approach Duplicate Graphs
University at BuffaloThe State University of New York Pattern Growth Approach Free extension
University at BuffaloThe State University of New York Pattern Growth Approach Right most extension
University at BuffaloThe State University of New York Pattern Growth Approach Exmaples (cont.)
University at BuffaloThe State University of New York gSpan
University at BuffaloThe State University of New York gSpan
University at BuffaloThe State University of New York Pattern Growth Approach Experimental result using Chemical data 340 molecules 66 atom types and 4 bond types as labels On average only 27 vertices with 28 edges
University at BuffaloThe State University of New York Summary Graph representation Flattern representation vs. DFS code Generation of Candidate Patterns apriori vs. pattern growth
University at BuffaloThe State University of New York
University at BuffaloThe State University of New York Pattern-Growth Approach
University at BuffaloThe State University of New York Frequent Graph Pattern Given a graph dataset D, find subgraph g, s.t. Where is the percentage of graphs in D that contain g. Problem 1 : Exponential Pattern Set Problem 2 : Threshold Setting
University at BuffaloThe State University of New York Difference between frequent itemset and frequent subgraph discovery
University at BuffaloThe State University of New York Frequent itemset discovery
University at BuffaloThe State University of New York subgraph Mining Algorithms Apriori-based approach – AGM/AcGM: Inokuchi, et al. (PKDD’00) – FSG: Kuramochi and Karypis (ICDM’01) – PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) – FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) – FTOSM: Horvath et al. (KDD’06) Pattern growth approach – Subdue: Holder et al. (KDD’94) – MoFa: Borgelt and Berthold (ICDM’02) – gSpan: Yan and Han (ICDM’02) – Gaston: Nijssen and Kok (KDD’04) – CMTreeMiner: Chi et al. (TKDE’05) – LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York Framework of subraph Mining Algorithms Search Order breadth vs. depth complete vs. incomplete Generation of Candidate Patterns apriori vs. pattern growth Discovery Order of Patterns DFS order path tree graph Elimination of Duplicate Subgraphs passive vs. active Support Calculation embedding store or not
University at BuffaloThe State University of New York Frequent Subgraph Examples:
University at BuffaloThe State University of New York Example (cont.)
University at BuffaloThe State University of New York Subgraph Mining Approaches Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM’01, pages , Nov PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) and SPIN: Huan et al. (KDD’04) FTOSM: Horvath et al. (KDD’06) Pattern growth approach Subdue: Holder et al. (KDD’94) MoFa: Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Yan, X. and Han, J gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721 Gaston: Nijssen and Kok (KDD’04) CMTreeMiner: Chi et al. (TKDE’05) LEAP: Yan et al. (SIGMOD’08)
University at BuffaloThe State University of New York Outline Introduction and Background Apriori-based Subgrah Mining Pattern Growth Subgraph Mining Summary DFS code Yan, X. and Han, J gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12, 2002). ICDM. IEEE Computer Society, Washington, DC, 721
University at BuffaloThe State University of New York Pattern Growth Approach