gApprox: Mining Frequent Approximate Patterns from a Massive Network

1 gApprox: Mining Frequent Approximate Patterns from a Massive Network
Chen Cheny, Xifeng Yanz, Feida Zhuy, Jiawei Han [ICDM 2007] reporter: Che-Wei, Liang 10/16 1

2 Outline Introduction Problem Formulation Algorithm Experiment
Pattern Space Exploration Support Counting Experiment Conclusions 2

3 Introduction A set of graphs vs. a single network
Recently, a large number of graphs with massive sizes and complex structures in many applications. Biological networks, social networks, Web. demanding powerful data mining methods. Now interested in patterns that frequently appear at many different places of a single network. 3

4 Introduction Protein-Protein Interaction (PPI) network
△= degree of approximation = 5 4

5 Two major complications
1. Mining frequent patterns in a single network Partition it into regions Each contains one occurrence of the pattern 2. Due to various inherent noise or data diversity, it is crucial to account for approximations so that all potentially interesting patterns can be captured. 5

7 Problem Formulation 7

8 Approximate Pattern Occurrences
Injective function m: Vp → VG mapping each vertex v Vp to m(v) VG Quantify the degree of approximation m incurs i.e., approximations can only happen within the matchable list. 8

9 Approximate Pattern Occurrences

10 Approximate Pattern Occurrences

11 Approximate Pattern Occurrences

12 Pattern Support with Approximation

13 Pattern Support with Approximation

14 Pattern Support with Approximation

16 Algorithm Two major issues: 1. Pattern Space Exploration
2. Support Counting Enumerate approximate occurrences of each pattern in the network. Decide the maximal number of disjoint occurrences. 16

17 Pattern Space Exploration
Decompose pattern space Find all connected vertex sets in G that contain 1. Remove 1 from G, and find all connected vertex sets in the new graph G’ that contain 2. And so on so forth … 17

18 Pattern Space Exploration
Example: Generating all connected vertex sets starting from 1. Stage1. Start from 1 and mark 1. Stage2. Expand from 1 to reach 2, 5, Mark 2, 5, 6. There are totally seven connected vertex sets in this stage {1,2}, {1,5}, {1,6}, {1,2,5}, {1,2,6}, {1,5,6}, {1,2,5,6} Stage3. Taking each of the seven connected vertex sets in stage 2 as a starting point, continue expansion. Stage4. Until there are no more unmarked vertices. 18

19 19

20 20

21 21

22 Theorem 1 Explore() in Algorithm 1 is both complete and redundancy-free, i.e., given a network G (1) it only generates connected vertex sets in G. (2) it can generate all connected vertex sets in G. (3) it does not generate the same connected vertex set more than once. 22

23 Support Counting A pattern P’s support is defined to be the maximal number of “disjoint” ones that can be chosen from P’s approximate occurrences in the network. — NP-Complete maximal independent set. Use algorithm 2 can provide an upperbound. 23

24 Support Counting 24

25 gApprox gApprox Combine with pattern space exploration and support counting. Conditional branch on the 3rd line of Algorithm 1’s DFS_horizontal() function. 25

26 Experiment 26

27 Conclusions Give an approximation measure and show its impact on mining. count a pattern’s support based on its approximate occurrences in the network. The techniques is general can be applied to networks from other domains. Can be modified to reach bigger, more interesting patterns even faster with some sacrifice on the completeness of mining results. 27

