Presentation is loading. Please wait.

Presentation is loading. Please wait.

KDD 2009 Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery Venu Satuluri and Srinivasan Parthasarathy Data Mining Research.

Similar presentations


Presentation on theme: "KDD 2009 Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery Venu Satuluri and Srinivasan Parthasarathy Data Mining Research."— Presentation transcript:

1 KDD 2009 Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery Venu Satuluri and Srinivasan Parthasarathy Data Mining Research Laboratory Dept. of Computer Science and Engineering The Ohio State University http://www.cse.ohio-state.edu/dmrl

2 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Outline Introduction -Problem Statement -Markov Clustering (MCL) Proposed Algorithms -Regularized MCL (R-MCL) -Multi-level Regularized MCL (MLR-MCL) Evaluation Conclusions 2

3 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Problem Statement Graph Clustering: Partition the vertices of a graph into disjoint sets such that each partition is a well-connected/coherent group. Applications: Discovery of protein complexes [Snel ‘02] Community discovery in social networks [Newman ‘06] Image segmentation [Shi ‘00] Existing solutions: Spectral methods [Shi ‘00] Edge-based agglomerative/divisive methods [Newman ‘04] Kernel K-Means [Dhillon ‘07] Metis [Karypis ’98] Markov Clustering (MCL) [van Dongen ’00] 3

4 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Markov Clustering (MCL) [van Dongen ‘00] The original algorithm for clustering graphs using stochastic flows. Advantages: Simple and elegant. Widely used in Bioinformatics because of its noise tolerance and effectiveness. Disadvantages: Very slow. - Takes 1.2 hours to cluster a 76K node social network. Prone to output too many clusters. - Produces 1416 clusters on a 4741 node PPI network. Can we redress the disadvantages of MCL while retaining its advantages? 4

5 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Terminology Flow: Transition probability from a node to another node. Flow matrix: Matrix with the flows among all nodes; i th column represents flows out of i th node. Each column sums to 1. 123 123 0.5 11 123 10 0 21.00 300.50 Flow Matrix 5

6 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy The MCL algorithm Expand: M := M*M Inflate: M := M.^r (r usually 2), renormalize columns Converged? Output clusters Input: A, Adjacency matrix Initialize M to M G, the canonical transition matrix M:= M G := (A+I) D -1 Yes Output clusters No Prune Enhances flow to well-connected nodes as well as to new nodes. Increases inequality in each column. “Rich get richer, poor get poorer.” Saves memory by removing entries close to zero. 6

7 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy The Regularize operator Why does MCL output many clusters? The original matrix is only used at the start, and neighboring information fades as time goes on. Called “overfitting”; it does not penalize divergence of flows between neighbors. Remedy: Let q i, i=1:k, be the flow distributions of the k neighbors of node q in the graph. Let w i, i=1:k, be the respective normalized edge weights, flow of q: Closed solution: This update defines the Regularize operator. In matrix notation, Regularize(M) := M*M G = M*(A+I)D -1 7

8 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy The Regularized-MCL algorithm Regularize: M := M*M G Inflate: M := M.^r (r usually 2), renormalize columns Converged? Output clusters Yes Output clusters No Prune Takes into account flows of the neighbors. Increases inequality in each column. “Rich get richer, poor get poorer.” Saves memory by removing entries close to zero. Input: A, Adjacency matrix Initialize M to M G, the canonical transition matrix M:= M G := (A+I) D -1 8

9 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Multi-level Regularized MCL Input Graph Intermediate Graph Intermediate Graph Coarsest Graph... Coarsen Run Curtailed R-MCL,project flow. Input Graph Run R-MCL to convergence, output clusters. Faster to run on smaller graphs first Captures global topology of graph Initializes flow matrix of refined graph 9

10 Coarsening operation Construct a matching: defined as a set of edges, no vertex is shared among these edges. Each edge is mapped into a super-node in the coarsened graph, and the new edges are the union of the original ones. Two maps used to keep the track of the process 10 1 23 4 5 6 14 23 56 matchingmapping ABC Map1: 1, 2, 5 Map1: 4, 3, 6 Map1: A B C

11 Project flow 11

12 Evaluation criteria The normalized cut of a cluster C in the graph G is defined as: Average Ncut: 12 /k/k Avg.

13 Comparison with MCL Why R-MCL is much faster than MCL? – Regularize is more faster that expansion, because M G is sparser than M, and R-MCL can stop earlier It seems MLR-MCL only upgrades the performance very less, especially the AVG. N-cut 13

14 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Comparison with Graclus and Metis Quality: MLR-MCL improves upon both Graclus and Metis 14

15 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Comparison with Graclus and Metis Speed: MLR-MCL is faster than Graclus and competitive with Metis 15

16 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Evaluation on PPI networks Yeast PPI network with 4741 proteins and 15148 interactions. Annotations from the Gene Ontology database used as ground truth. MLR-MCL returns clusters of higher biological significance than MCL or Graclus. 16

17 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Conclusions Regularized MCL overcomes the fragmentation problem of MCL. Multi-level Regularized MCL further improves quality and speed of R-MCL. MLR-MCL often outperforms state-of-the-art algorithms, both quality and speed-wise, on a wide variety of real datasets. Future Directions: Novel coarsening strategies Extensions to directed and bi-partite graphs. Acknowledgements: This work is supported in part by the following grants: NSF CAREER IIS-0347662, RI-CNS-0403342, CCF-0702586 and IIS-0742999 17

18 Scalable Graph Clustering using Stochastic FlowsVenu Satuluri and Srinivasan Parthasarathy Thank You! References: 1.MCL - Graph Clustering by Flow Simulation. S. van Dongen, Ph.D. thesis, University of Utrecht, 2000. 2.Graclus - Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. Dhillon et. al., IEEE. Trans. PAMI, 2007. 3.Metis - A fast and high quality multilevel scheme for partitioning irregular graphs. Karypis and Kumar, SIAM J. on Scientific Computing, 1998 4.Normalized Cuts and Image Segmentation. Shi and Malik, IEEE. Trans. PAMI, 2000. 5.Finding and evaluating community structure in networks. Newman and Girvan, Phys. Rev. E 69, 2004. 6.The identification of functional modules from the genomic association of genes. Snel et. al., PNAS 2002. 18


Download ppt "KDD 2009 Scalable Graph Clustering using Stochastic Flows Applications to Community Discovery Venu Satuluri and Srinivasan Parthasarathy Data Mining Research."

Similar presentations


Ads by Google