Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate Graph Mining with Label Costs

Similar presentations


Presentation on theme: "Approximate Graph Mining with Label Costs"— Presentation transcript:

1 Approximate Graph Mining with Label Costs
-Pranay Anchuri1, Mohammed Zaki1,2, Omer Barkol3, Shahar Golan3, Moshe Shamy3 1Rensselaer Polytechnic Institute, Troy, NY 2Qatar Computing Research Institute, Doha, Qatar 3HP Labs, Israel.

2 Approximate Graph Mining with Label Costs

3 Approximate Graph Mining with Label Costs

4 Approximate Graph Mining with Label Costs

5 CRYSTAL STRUCTURE OF STREPTOCOCCAL PYROGENIC EXOTOXIN A
3D structure of protein can be represented as undirected graph. SCOP is a hierarchical classification of proteins based on the structure. Class, Fold, Family, Super Family.

6 log score of substituting amino acids

7 log score of substituting amino acids
Question : Are there any common motifs present among proteins in the same class or family ?

8 PPI network of yeast Similarity obtained using BLAST alignment score.

9 PPI network of yeast Question : Which sets of proteins interact together and what is the underlying biological process that they are involved in ?

10 CMDB graphs

11 CMDB graphs Question : What are the de-facto policies observed in the organization ?

12 Outline Approximate Subgraph Isomorphism Approach Results Conclusion
Label based representative pruning. Support computation. Results Conclusion

13 Subgraph Isomorphism A B C A B C Database Graph Pattern

14 Cost of Subgraph Isomorphism
Database Graph 0.2 0.2 Pattern Cost A B C 0.2 0.6 0.4

15 Approximate Isomorphism
B C A B C Database Graph 0.2 0.2 Pattern For the rest of the talk, threshold corresponds to the maximum error that we allow for the labels in the graph. User defined threshold, say 1.0 Approximate isomorphism iff cost <= threshold Cost A B C 0.2 0.6 0.4

16 Problem Statement Given : Database graph G, cost matrix C, cost threshold and minimum support Find frequent approximate maximal patterns.

17 Related Work C. Chen et al [ICDM 07]:
gApprox: Mining frequent approximate patterns from a massive network. S. Zhang and J. Yang [ SSDBM 2008]: Randomized Approximate graph mining. Graph Querying : TALE : Y. Tian et al ICDE 2008 NeMa : Arjit Khan et al VLDB 2013

18 Isomorphisms → Representatives
Approximate Isomorphisms Pat_0 Pat_1 Pat_2 1 2 4 3 5 6 Representative Vertices Pat_0 1 2 4 Pat_1 3 5 Pat_2 6 --- Set of unique vertices in each column Representative Set : Set of all vertices in the database to which the pattern vertex is mapped in approximate ismorphisms. Memory requirements : O( |P| * |V|)

19 Approach Enumerate a candidate pattern
Compute the representatives of pattern vertices Compute the support of the pattern

20 Approach Enumerate a candidate pattern
Compute the representatives of pattern vertices Compute the support of the pattern

21 Approach Enumerate a candidate pattern Candidate representative set
Compute the representatives of pattern vertices Use derived labels to prune the candidates Compute the support of the pattern Verify the remaining candidates

22 k-hop : Derived Label -1 Vertices reachable in exactly k hops.
10 40 20 50 Without visiting any vertex more than once 60 30

23 Cost of matching k-hop labels.
Cost of matching k-hop labels h1 and h2 = Minimum cost injective mapping from h1 → h2. Cost = min cost max flow in a flow network. Vertex Label v1 A v2 B v3 Label Vertex B v’1 C v’2 A v’3 Cost Matrix

24 Theorem on k-hop matching cost
Thm : Cost of matching the k-hop label of pattern vertex and that of its representative <= threshold. Intuition : k-hop label is subset of vertex set. k-hop labels of database nodes are precomputed. NP-Hard but tractable for small k!

25 Is 20 a representative of 2 with threshold = 0.5?
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30

26 Is 20 a representative of 2 with threshold = 0.5? K hk2 hk20
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30 K hk2 hk20 Cost(hk2, hk20) 2 4, 6 40, 50 , 60 0.4 ( )

27 Is 20 a representative of 2 with threshold = 0.5? K hk2 hk20
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30 We do this for every pair of pattern vertex and its representative. Pruning can be used when checking for (1, 10) K hk2 hk20 Cost(hk2, hk20) 4 4, 6 30 , 60 0.6

28 Is 20 a representative of 2 with threshold = 0.5? K hk2 hk20
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30 We do this for every pair of pattern vertex and its representative. Pruning can be used when checking for (1, 10) K hk2 hk20 Cost(hk2, hk20) 4 4, 6 30 , 60 0.6

29 Neighbor Concatenated Label (NL) -Derived Label 2
k-hop label with history! NL of u in iteration k K-hop label of u NL of neighbors of u in iteration k-1

30 Dominating property of NCL
NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff

31 Dominating property of NCL
NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of u K-hop label of v K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff cost of matching k-hop label <= threshold.

32 Dominating property of NCL
NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of u K-hop label of v K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff cost of matching k-hop label <= threshold. Injective mapping between the neighbors of u and v

33 Theorem on NL Thm: NL of representative vertex dominates NL of pattern vertex for all k. Intuition : if v represents u, neighbors of v represents neighbors of u

34 Approach Enumerate a candidate pattern Candidate representative set
Compute the representatives of pattern vertices Use derived labels to prune the candidates Compute the support of the pattern Verify the remaining candidates

35 Verifying candidates k-hop label and NCL are only necessary conditions. To compute exact representative set complete enumeration is required ( combinatorial explosion) But just one approximate subgraph isomorphism is sufficient.

36 Approach Enumerate a candidate pattern
Compute the representatives of pattern vertices Compute the support of the pattern

37 Enumerating candidate patterns
Sampling : Random 1 edge extensions of the frequent pattern. Continue till no extension is frequent. Alternatives : DFS, BFS in the search space.

38 Approach Enumerate a candidate pattern
Compute the representatives of pattern vertices Compute the support of the pattern

39 Support of the pattern Has to be anti-monotonic.
Support : Size of the smallest representative set. Lower bounded by the vertex disjoint support 1. M. Kuramochi and G.karypis1[DMKD ‘05] : Finding frequent patterns in large sparse graph.

40 Time & space complexity
Space : |VG| for each vertex to store representatives + precomputed k-hop labels. Time : Computing max flow for NL dominance check is the main bottleneck. Flow networks are usually small, ( order of average degree of nodes)

41 Results on real world datasets

42 Datasets Dataset |V| |E| # Labels Pre processing SCOP 39256 154328 20
CMDB 10466 15122 84 329.31s PPI 4950 16515 ---

43

44 gApprox: Mining Frequent Approximate Patterns from a Massive Network [Chen et al, ICDM 07]

45

46

47 Motifs in proteins

48 Patterns in PPI network

49 Conclusion Proposed a method to mine approximate frequent patterns from large single graph/database graphs. Label based pruning of the representative patterns. Proposed method finds meaningful patterns in several real world graphs.

50 Thank you :) Questions ?


Download ppt "Approximate Graph Mining with Label Costs"

Similar presentations


Ads by Google