Mining in Graphs and Complex Structures

Mining in Graphs and Complex Structures
01 Background 02 gSpan 03 Improvement Presented by: Yi He 04 Application

1 2 3 4 01 Background Why we need Graphs Mining
What is Frequent Subgraphs Background 3 Approaches of Graphs Mining 4 Two Challenges

Background: Why Graph Mining?
Graphs are prevalent Chemical compounds Protein structures, biological networks Program control flow, traffic flow, and workflow analysis Social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected Labeled vs. unlabeled (edges & vertices) Weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity! 01 Background Graph better than 1000 words 100 years of solitude by Garcia Markquez

01 Background Aspirin Yeast protein interaction network Internet
Social network

Frequent subgraphs: How to find?
A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold (minsup). 01 Background hydroxide-ion

Frequent Subgraph: Mining Approaches 01 Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth-based approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) Background Search order: apriori vs. pattern growth Elimination of duplicate subgraphs: passive vs. active We do simple job, mining in different graphs, find frequent transaction, rather than the whole structure

Frequent Subgraph: Challenges 01 Background
Pattern growth = generating longer and longer How to order and generate efficiently by trunk some at early stage. How to identify the duplicated ones How to reduce exponential time cost induced by candidate generation. How to reduce exponential time cost induced by isomorphism test.

Frequent subgraphs: Isomorphism Test
01 Background Kernel = NP-complete problem = no algorithm can solve it yet Reference point X in graph (a) Frequent subgraphs contains information

02 Introduction DFS Code DFS Lexicographic Order VS Minimum DFS Code
Depth First Search DFS Code Tree & Rightmost Path

Depth First Search: 02 Introduction

DFS Code Tree & Rightmost Path Depth First Search & Rightmost Path

DFS Code: 02 Introduction

DFS code: Edge Order 02 Introduction

DFS Lexicographic Ordering:
→ Different search starting point. →Mutiple DFS Code sets 02 Introduction

α is a code set of at, β is a code set of bt.
DFS Lexicographic Ordering 02 Introduction α is a code set of at, β is a code set of bt. B=backward edges F= forward edges

02 Introduction

Righmost Path for Extension
02 When building DFS codes, must expand all back edges first! Introduction shortest path between V0 and Vn using forward edges

Righmost Path (completeness)
? 02 X Y Z a b c Introduction shortest path between V0 and Vn using forward edges

gSpan Algorithm: 02 Introduction Mining Result Dataset
Since they are same, we select the one has minimum DFS.

DFS Code Tree 02 Introduction
Minsup = 3 02 Introduction Since they are same, we select the one has priori consequencial DFS code . Iff the min(G)=min(G’), G is isomorphic to G’by given two graphs G and G’.

Depth First Search: 02 Convert Graphs into Strings Introduction

Experimental Result: 02 Introduction Size of frequent kernel I
Possible label N D L fixed

03 Improvement gPrune Closest Path gSpan FSG 2001 2002 2003 2007

Closed Frequent Graphs:
Motivation: Handling graph pattern explosion problem. Closed frequent graph: A frequent graph G is closed if there exists no super-graph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (non-closed graphs) Lossless compression: still ensures that the mining result is complete Improvement Always have the same support for G and G’ Mined all the frequent graphs and cut them = not efficient gSpan = 2^100 complexisity

CLOSEGRAPH (Yan & Han, KDD’03)
Improvement

CLOSEGRAPH: Performance Improvement

Constraint-Based Graph Pattern Mining
gPrune: Constraint-Based Graph Pattern Mining Improvement Main Idea: Push constraints deeply into the mining process to reduces search space. (Constraints can be classified into different categories, Different categories require different pushing strategies) There are often various kinds of constraints (with domain knowledge) specified for mining graph pattern P, e.g., max_degree(P) ≥ 10 diameter(P) ≥δ

Pattern Pruning vs. Data Pruning
gPrune: Pattern Pruning vs. Data Pruning Improvement P Pruning = generate the sub pattern, if not satisfied and keep grow no way to satisfiy= cut D Pruning = Divided the data set into small set = cut off the set have no way to satisfy the constrain

Pattern Pruning vs. Data Pruning
gPrune: Pattern Pruning vs. Data Pruning Improvement

Pruning Pattern Search Space
gPrune: Pruning Pattern Search Space Improvement Density ratio = available / Possible edges In big graph, at least one small one will meet the requirement of density ratio.

gPrune: Pruning Data Space Improvement
Separable = rest + exist < 10 then trunk Why middle? Find frequent part = useful part =cut not useful componants= Know not sufficient

gPrune: Pruning Data Space Improvement
Always look back to the dataset, as soon as constraint cannot be satisfied= cut

gPrune: Graph Constraints-- A General Picture
Improvement Not efficient all the time, at least one knife is useful.

An traffic mining approach:
Application Traffic old = Euclidean space distance New= high crime, construction, traffic jam, high crash rate. Suggest path contains as much as frequent path segment *Road hierarchy-based partitioning *Speed rule mining Driving Pattern mining Adaptive pre computation Road upgrading Adaptive fastest path algorithms

Exam Questions: Q1) What Does Gspan Compare When Testing For Isomorphism Between Two Graphs, And Why? Answer: Gspan Compares The Minimum DFS Codes Of The Two Graphs. Given Two Graphs G And G’, G Is Isomorphic To G’ If Min(g)=min(g’). This Theorem Allows For A Simple String Comparison Of More Complicated Graphs. If Two Nodes Contain The Same Graph But Different Minimum DFS Codes, We Can Prune The Sub-branch Of The Rightmost Of The Two Nodes. This Greatly Decreases The Problem Size. Application

Q2) Which pattern Does The DFS Code Below Belong To?
Exam Questions: Application Q2) Which pattern Does The DFS Code Below Belong To? Answer: tree (c)

Exam Questions: Application Q3) What is the main idea for closedGraph mining? Answer: To reduce the search space, by finding the “early termination”. If G and G’ are frequent, G is an subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, that is G and G’ share the same support, then we stop here and do not need grow G, since none of G’s children will be closed except those of G’.

THANK YOU! Presented by Yi He

Mining in Graphs and Complex Structures

Similar presentations

Presentation on theme: "Mining in Graphs and Complex Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining in Graphs and Complex Structures

Similar presentations

Presentation on theme: "Mining in Graphs and Complex Structures"— Presentation transcript:

Similar presentations

About project

Feedback