Approximate Graph Mining with Label Costs

Approximate Graph Mining with Label Costs
-Pranay Anchuri1, Mohammed Zaki1,2, Omer Barkol3, Shahar Golan3, Moshe Shamy3 1Rensselaer Polytechnic Institute, Troy, NY 2Qatar Computing Research Institute, Doha, Qatar 3HP Labs, Israel.

Approximate Graph Mining with Label Costs

CRYSTAL STRUCTURE OF STREPTOCOCCAL PYROGENIC EXOTOXIN A
3D structure of protein can be represented as undirected graph. SCOP is a hierarchical classification of proteins based on the structure. Class, Fold, Family, Super Family.

log score of substituting amino acids

log score of substituting amino acids
Question : Are there any common motifs present among proteins in the same class or family ?

PPI network of yeast Similarity obtained using BLAST alignment score.

PPI network of yeast Question : Which sets of proteins interact together and what is the underlying biological process that they are involved in ?

CMDB graphs

CMDB graphs Question : What are the de-facto policies observed in the organization ?

Outline Approximate Subgraph Isomorphism Approach Results Conclusion
Label based representative pruning. Support computation. Results Conclusion

Subgraph Isomorphism A B C A B C Database Graph Pattern

Cost of Subgraph Isomorphism
Database Graph 0.2 0.2 Pattern Cost A B C 0.2 0.6 0.4

Approximate Isomorphism
B C A B C Database Graph 0.2 0.2 Pattern For the rest of the talk, threshold corresponds to the maximum error that we allow for the labels in the graph. User defined threshold, say 1.0 Approximate isomorphism iff cost <= threshold Cost A B C 0.2 0.6 0.4

Problem Statement Given : Database graph G, cost matrix C, cost threshold and minimum support Find frequent approximate maximal patterns.

Related Work C. Chen et al [ICDM 07]:
gApprox: Mining frequent approximate patterns from a massive network. S. Zhang and J. Yang [ SSDBM 2008]: Randomized Approximate graph mining. Graph Querying : TALE : Y. Tian et al ICDE 2008 NeMa : Arjit Khan et al VLDB 2013

Isomorphisms → Representatives
Approximate Isomorphisms Pat_0 Pat_1 Pat_2 1 2 4 3 5 6 Representative Vertices Pat_0 1 2 4 Pat_1 3 5 Pat_2 6 --- Set of unique vertices in each column Representative Set : Set of all vertices in the database to which the pattern vertex is mapped in approximate ismorphisms. Memory requirements : O( |P| * |V|)

Approach Enumerate a candidate pattern
Compute the representatives of pattern vertices Compute the support of the pattern

Approach Enumerate a candidate pattern Candidate representative set
Compute the representatives of pattern vertices Use derived labels to prune the candidates Compute the support of the pattern Verify the remaining candidates

k-hop : Derived Label -1 Vertices reachable in exactly k hops.
10 40 20 50 Without visiting any vertex more than once 60 30

Cost of matching k-hop labels.
Cost of matching k-hop labels h1 and h2 = Minimum cost injective mapping from h1 → h2. Cost = min cost max flow in a flow network. Vertex Label v1 A v2 B v3 Label Vertex B v’1 C v’2 A v’3 Cost Matrix

Theorem on k-hop matching cost
Thm : Cost of matching the k-hop label of pattern vertex and that of its representative <= threshold. Intuition : k-hop label is subset of vertex set. k-hop labels of database nodes are precomputed. NP-Hard but tractable for small k!

Is 20 a representative of 2 with threshold = 0.5?
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30

Is 20 a representative of 2 with threshold = 0.5? K hk2 hk20
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30 K hk2 hk20 Cost(hk2, hk20) 2 4, 6 40, 50 , 60 0.4 ( )

Is 20 a representative of 2 with threshold = 0.5? K hk2 hk20
1 10 40 A C 4 A B Cost A B C D 0.7 0.6 0.1 0.3 1 0.8 C A C B 5 2 20 50 B D Database Pattern A D 3 6 60 Is 20 a representative of 2 with threshold = 0.5? 30 We do this for every pair of pattern vertex and its representative. Pruning can be used when checking for (1, 10) K hk2 hk20 Cost(hk2, hk20) 4 4, 6 30 , 60 0.6

Neighbor Concatenated Label (NL) -Derived Label 2
k-hop label with history! NL of u in iteration k K-hop label of u NL of neighbors of u in iteration k-1

Dominating property of NCL
NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff

NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of u K-hop label of v K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff cost of matching k-hop label <= threshold.

NL of u in iteration k NL of v in iteration k K-hop label of u K-hop label of u K-hop label of v K-hop label of v NL of neighbors of u in iteration k-1 NL of neighbors of v in iteration k-1 NL of v dominates NL of u iff cost of matching k-hop label <= threshold. Injective mapping between the neighbors of u and v

Theorem on NL Thm: NL of representative vertex dominates NL of pattern vertex for all k. Intuition : if v represents u, neighbors of v represents neighbors of u

Approach Enumerate a candidate pattern Candidate representative set
Compute the representatives of pattern vertices Use derived labels to prune the candidates Compute the support of the pattern Verify the remaining candidates

Verifying candidates k-hop label and NCL are only necessary conditions. To compute exact representative set complete enumeration is required ( combinatorial explosion) But just one approximate subgraph isomorphism is sufficient.

Enumerating candidate patterns
Sampling : Random 1 edge extensions of the frequent pattern. Continue till no extension is frequent. Alternatives : DFS, BFS in the search space.

Support of the pattern Has to be anti-monotonic.
Support : Size of the smallest representative set. Lower bounded by the vertex disjoint support 1. M. Kuramochi and G.karypis1[DMKD ‘05] : Finding frequent patterns in large sparse graph.

Time & space complexity
Space : |VG| for each vertex to store representatives + precomputed k-hop labels. Time : Computing max flow for NL dominance check is the main bottleneck. Flow networks are usually small, ( order of average degree of nodes)

Results on real world datasets

Datasets Dataset |V| |E| # Labels Pre processing SCOP 39256 154328 20
CMDB 10466 15122 84 329.31s PPI 4950 16515 ---

gApprox: Mining Frequent Approximate Patterns from a Massive Network [Chen et al, ICDM 07]

Motifs in proteins

Patterns in PPI network

Conclusion Proposed a method to mine approximate frequent patterns from large single graph/database graphs. Label based pruning of the representative patterns. Proposed method finds meaningful patterns in several real world graphs.

Thank you :) Questions ?

Approximate Graph Mining with Label Costs

Similar presentations

Presentation on theme: "Approximate Graph Mining with Label Costs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Graph Mining with Label Costs

Similar presentations

Presentation on theme: "Approximate Graph Mining with Label Costs"— Presentation transcript:

Similar presentations

About project

Feedback