Download presentation
Presentation is loading. Please wait.
1
Mining in Graphs and Complex Structures
01 Background 02 gSpan 03 Improvement Presented by: Yi He 04 Application
2
1 2 3 4 01 Background Why we need Graphs Mining
What is Frequent Subgraphs Background 3 Approaches of Graphs Mining 4 Two Challenges
3
Background: Why Graph Mining?
Graphs are prevalent Chemical compounds Protein structures, biological networks Program control flow, traffic flow, and workflow analysis Social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected Labeled vs. unlabeled (edges & vertices) Weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity! 01 Background Graph better than 1000 words 100 years of solitude by Garcia Markquez
4
01 Background Aspirin Yeast protein interaction network Internet
Social network
5
Frequent subgraphs: How to find?
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold (minsup). 01 Background hydroxide-ion
6
Frequent Subgraph: Mining Approaches 01 Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth-based approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) Background Search order: apriori vs. pattern growth Elimination of duplicate subgraphs: passive vs. active We do simple job, mining in different graphs, find frequent transaction, rather than the whole structure
7
Frequent Subgraph: Challenges 01 Background
Pattern growth = generating longer and longer How to order and generate efficiently by trunk some at early stage. How to identify the duplicated ones How to reduce exponential time cost induced by candidate generation. How to reduce exponential time cost induced by isomorphism test.
8
Frequent subgraphs: Isomorphism Test
01 Background Kernel = NP-complete problem = no algorithm can solve it yet Reference point X in graph (a) Frequent subgraphs contains information
9
02 Introduction DFS Code DFS Lexicographic Order VS Minimum DFS Code
Depth First Search DFS Code Tree & Rightmost Path
10
Depth First Search: 02 Introduction
11
02 Introduction DFS Code DFS Lexicographic Order VS Minimum DFS Code
DFS Code Tree & Rightmost Path Depth First Search & Rightmost Path
12
DFS Code: 02 Introduction
13
DFS code: Edge Order 02 Introduction
14
02 Introduction DFS Code DFS Lexicographic Order VS Minimum DFS Code
DFS Code Tree & Rightmost Path Depth First Search & Rightmost Path
15
DFS Lexicographic Ordering:
→ Different search starting point. →Mutiple DFS Code sets 02 Introduction
16
α is a code set of at, β is a code set of bt.
DFS Lexicographic Ordering 02 Introduction α is a code set of at, β is a code set of bt. B=backward edges F= forward edges
17
02 Introduction
18
02 Introduction DFS Code DFS Lexicographic Order VS Minimum DFS Code
DFS Code Tree & Rightmost Path Depth First Search & Rightmost Path
19
Righmost Path for Extension
02 When building DFS codes, must expand all back edges first! Introduction shortest path between V0 and Vn using forward edges
20
Righmost Path (completeness)
? 02 X Y Z a b c Introduction shortest path between V0 and Vn using forward edges
21
gSpan Algorithm: 02 Introduction Mining Result Dataset
Since they are same, we select the one has minimum DFS.
22
DFS Code Tree 02 Introduction
Minsup = 3 02 Introduction Since they are same, we select the one has priori consequencial DFS code . Iff the min(G)=min(G’), G is isomorphic to G’by given two graphs G and G’.
23
Depth First Search: 02 Convert Graphs into Strings Introduction
24
Experimental Result: 02 Introduction Size of frequent kernel I
Possible label N D L fixed
25
03 Improvement gPrune Closest Path gSpan FSG 2001 2002 2003 2007
26
Closed Frequent Graphs:
Motivation: Handling graph pattern explosion problem. Closed frequent graph: A frequent graph G is closed if there exists no super-graph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (non-closed graphs) Lossless compression: still ensures that the mining result is complete Improvement Always have the same support for G and G’ Mined all the frequent graphs and cut them = not efficient gSpan = 2^100 complexisity
27
CLOSEGRAPH (Yan & Han, KDD’03)
Improvement
28
CLOSEGRAPH: Performance Improvement
29
Constraint-Based Graph Pattern Mining
gPrune: Constraint-Based Graph Pattern Mining Improvement Main Idea: Push constraints deeply into the mining process to reduces search space. (Constraints can be classified into different categories, Different categories require different pushing strategies) There are often various kinds of constraints (with domain knowledge) specified for mining graph pattern P, e.g., max_degree(P) ≥ 10 diameter(P) ≥δ
30
Pattern Pruning vs. Data Pruning
gPrune: Pattern Pruning vs. Data Pruning Improvement P Pruning = generate the sub pattern, if not satisfied and keep grow no way to satisfiy= cut D Pruning = Divided the data set into small set = cut off the set have no way to satisfy the constrain
31
Pattern Pruning vs. Data Pruning
gPrune: Pattern Pruning vs. Data Pruning Improvement
32
Pruning Pattern Search Space
gPrune: Pruning Pattern Search Space Improvement Density ratio = available / Possible edges In big graph, at least one small one will meet the requirement of density ratio.
33
gPrune: Pruning Data Space Improvement
Separable = rest + exist < 10 then trunk Why middle? Find frequent part = useful part =cut not useful componants= Know not sufficient
34
gPrune: Pruning Data Space Improvement
Always look back to the dataset, as soon as constraint cannot be satisfied= cut
35
gPrune: Graph Constraints-- A General Picture
Improvement Not efficient all the time, at least one knife is useful.
36
An traffic mining approach:
Application Traffic old = Euclidean space distance New= high crime, construction, traffic jam, high crash rate. Suggest path contains as much as frequent path segment *Road hierarchy-based partitioning *Speed rule mining Driving Pattern mining Adaptive pre computation Road upgrading Adaptive fastest path algorithms
37
Exam Questions: Q1) What Does Gspan Compare When Testing For Isomorphism Between Two Graphs, And Why? Answer: Gspan Compares The Minimum DFS Codes Of The Two Graphs. Given Two Graphs G And G’, G Is Isomorphic To G’ If Min(g)=min(g’). This Theorem Allows For A Simple String Comparison Of More Complicated Graphs. If Two Nodes Contain The Same Graph But Different Minimum DFS Codes, We Can Prune The Sub-branch Of The Rightmost Of The Two Nodes. This Greatly Decreases The Problem Size. Application
38
Q2) Which pattern Does The DFS Code Below Belong To?
Exam Questions: Application Q2) Which pattern Does The DFS Code Below Belong To? Answer: tree (c)
39
Exam Questions: Application Q3) What is the main idea for closedGraph mining? Answer: To reduce the search space, by finding the “early termination”. If G and G’ are frequent, G is an subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, that is G and G’ share the same support, then we stop here and do not need grow G, since none of G’s children will be closed except those of G’.
40
THANK YOU! Presented by Yi He
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.