Gspan: Graph-based Substructure Pattern Mining

Slides:

Advertisements

Similar presentations

Graph Algorithms Algorithm Design and Analysis Victor AdamchikCS Spring 2014 Lecture 11Feb 07, 2014Carnegie Mellon University.

Advertisements

Recap: Mining association rules from large datasets

DS.GR.14 Graph Matching Input: 2 digraphs G1 = (V1,E1), G2 = (V2,E2) Questions to ask: 1.Are G1 and G2 isomorphic? 2.Is G1 isomorphic to a subgraph of.

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Graph Mining Laks V.S. Lakshmanan

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

gSpan: Graph-based substructure pattern mining

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.

5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.

Frequent Structure Mining Prajwal Shrestha Department of Computer Science The University of Vermont Spring 2015.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis (7) (Mining Graphs)

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Fast Algorithms for Association Rule Mining

Elementary graph algorithms Chapter 22

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

CS 312: Algorithm Analysis Lecture #32: Intro. to State-Space Search This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported.

Frequent Structure Mining Presented By: Ahmed R. Nabhan Computer Science Department University of Vermont Fall 2011.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Discussion #32 1/13 Discussion #32 Properties and Applications of Depth-First Search Trees.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Frequent Structure Mining Robert Howe University of Vermont Spring 2014.

Graph Indexing From managing and mining graph data.

COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.

Mining in Graphs and Complex Structures

Mining Frequent Subgraphs

Market Basket Analysis and Association Rules

Mining Complex Data COMP Seminar Spring 2011.

Graph Database Mining and Its Applications

Mining Frequent Subgraphs

FP-Growth Wenlong Zhang.

Mining Frequent Subgraphs

Finding Frequent Itemsets by Transaction Mapping

Presentation transcript:

Gspan: Graph-based Substructure Pattern Mining Presented By: Sadik Mussah

Background Problem Definition Authors Contribution Outlines Background Problem Definition Authors Contribution Concepts Behind Gspan Experimental Result Conclusion

Background Frequent Subgraph Mining Is An Extension To Existing Frequent Pattern Mining Algorithms A Major Challenge Is To Count How Many Instances Of A Pattern Are In The Dataset Counting Instances Might Be Easy For Sets, But Subtle For Graphs Recall The Graph Isomorphism Problem

Two Isomorphic graph (a) and (b) with their mapping function (c) Background G1=(V1,E1,L1) G2=(V2,E2,L2) f(V1.1) = V2.2 f(V1.2) = V2.5 f(V1.3) = V2.3 f(V1.4) = V2.4 f(V1.5) = V2.1 Y 5 V 3 X 3 1 W 1 V 4 W U 2 Y 2 U 5 4 X (a) (b) (c) Two Isomorphic graph (a) and (b) with their mapping function (c) Two Graphs Are Isomorphic If One Can Find A Mapping Of Nodes Of The First Graph To The Second Graph Such That Labels On Nodes And Edges Are Preserved.

Problem: Finding Frequent Subgraphs Problem Setting: Similar To Finding Frequent Itemsets For Association Rule Discovery Input: Database Of Graph Transactions Undirected Simple Graph (No Multiples Edges) Each Graph Transaction Has Labeled Edges/Vertices. Transactions May Not Be Connected Minimum Support Thresholds Output: Frequent Subgraphs That Satisfy The Support Threshold, Where Each Frequent Subgraph Is Connected.

Finding Frequent Subgraphs

Authors Contribution Representing Graphs As Strings (Like Treeminer) No Candidate Generation! “It Combines The Growing And Checking Of Frequent Subgraphs Into One Procedure, Thus Accelerates The Mining Process.” Really Fast, Still A Standard Baseline System That Most Rivals Compare Their Systems To.

Concepts Behind Gspan The Idea Is To Produces A Depth-first Search (DFS) Codes For Each Edge In Graphs Edges Are Sorted According To Lexicographic Order Of Codes Yan And Han Proved That Graph Isomororphism Can Be Tested For Two Graphs Annotated With DFS Codes Starting With Small Graph Patterns Containing 1-edge, Patterns Are Expanded Systemically By The DFS Search Employ Anti-monotonic Property Of Graph Frequency

Anti-monotonicity Of Graph Frequency The Frequency Of A Super-pattern Is Less Than Or Equal To The Frequency Of A Sub-pattern. Copyright SIGMOD’08

Lexicographic Ordering In Graph It Can Tell Us The Order Of Two Graphs. The Design Can Help Us Build A Similar Hierarchy. The Design Should Guarantee Easy-growing From One Level To The Lower Level And Easy-rolling-up From Low Level To Higher Level. It May Be Difficult To Have Such Design That No Two Nodes In This Tree Are Same For Graph Case. It Can Tell Us Whether The Graph Has Been Discovered. And More, The Most Important, If A Graph Has Been Discovered, All Its Children Nodes In The Hierarchy Must Have Been Discovered.

Lexicographic Ordering in Graph 1-edge ... 2-edge ... ... ... ... 3-edge ... ... ...

DFS Code And Minimum DFS Code We Use A 5-tuple (Vi, Vj, L(vi), L(vj), L(vi,vj)) To Represent An Edge. (It May Be Redudant, But Much Easier To Understand.) Turn A Graph Into A Sequence Whose Basic Element Is 5-tuple. Form The Sequence In Such An Order: To Extend One New Node, Add The Forward Edge That Connect One Node In The Old Graph With This New Node. Add All Backward Edge That Connect This New Node To Other Nodes In The Old Graph Repeat This Procedure.

DFS code X a a Y a e0: (0,1,x,y,a) Y X d b a e2: (2,0,x,x,a) X b e1: (1,2,y,x,b) Z X b Z d e5: (1,4,x,z,d) c b e4: (3,1,x,y,b) Z Z c e3: (2,3,x,z,c) v0 v3 v1 v4 v2

DFS Code And Minimum DFS Code Depth First Tree And Forward/Backward Edge Set

Minimum DFS code Each Graph may have lots of DFS code (why?): one smallest lexicographic one is its Minimum DFS Code

Graph Parent And Its Children Given a DFS code c0=(e0,e1,…,en) if c1=(e0,e1,…,en,ex) if c0<c1, then c0 is c1’s parent, c1 is c0’s child. ? X Y Z a b c

DFS Code Tree 1-edge ... 2-edge ... ... ... ... 3-edge ... ... ...

Theorem 1. Given Two Graph G0 And G1, G0 Is Isomorphic To G1 Iff Min_dfs_code(g0)=min_dfs_code(g1). 2. DFS Code Tree Covers All Graphs Although Some Tree Nodes May Represent The Same Graph 3. Given A Node In DFS Code Tree, If Its DFS Code Is Not Its Minimum DFS Code, Prune This Node And Its All Descendants Won’t Change. “Covering”.

Algorithm

Algorithm

Experimental Result

Experimental Result

Conclusion No Candidate Generation And False Test Space Saving From Depth First Search Good Performance: Using “Memory Pool” And One Major Counting Improvement, It Seems The Performance Will Be Improved 5 Times More. (But Need More Testing).

Exam Questions Q1) What Two Major Costs From Apriori-like, Frequent Substructure Mining Algorithms Did Gspan Aim To Reduce/Avoid? Answer: 1) The Creation Of Size K+1 Candidate Subgraphs From Size K Frequent Subgraphs Is More Complicated And Costly The Standard Apriori Large Itemset Generation. 2) Pruning False Positives Is An Expensive Process. Subgraph Isomorphism Problem Is Np-complete.

Exam Questions (cont.) Q2) Which DFS Tree Does The DFS Code Below Belong To?

Answer: tree (c) v0 Y d a z v4 x v1 b b a x v2 c z v3

Exam Questions Q3) What Does Gspan Compare When Testing For Isomorphism Between Two Graphs, And Why? Answer: Gspan Compares The Minimum Dfs Codes Of The Two Graphs. Given Two Graphs G And G’, G Is Isomorphic To G’ If Min(g)=min(g’). This Theorem Allows For A Simple String Comparison Of More Complicated Graphs. If Two Nodes Contain The Same Graph But Different Minimum DFS Codes, We Can Prune The Sub-branch Of The Rightmost Of The Two Nodes. This Greatly Decreases The Problem Size.

Questions?