Community detection algorithms: a comparative analysis Santo Fortunato
More links “inside” than “outside” Graphs are “sparse” “Communities”
Metabolic Protein-protein SocialEconomical
Confusion about the main concepts: community, partition, null models (Too) Many algorithms around How shall we test them? Problems
Testing a method means applying it to graphs with know community structure (benchmarks) Benchmarks are then based on an implicit definition of community Ideally algorithms have to be based on the same definition/principle, otherwise there is inconsistency
The planted l-partition model (Condon & Karp, 1999) n nodes, l equal-sized groups with g=n/l nodes p = probability that two nodes in the same group are connected q = probability that two nodes in different groups are connected If p>q, communities are there!
Benchmark of Girvan & Newman 128 nodes, 4 groups, average degree 16 All nodes have the same degree Special case of planted l-partition model, with n=128, l=4, g=32
Problems with GN benchmark All nodes have the same degree All communities have equal size In real networks the distributions of degree and community size is highly heterogeneous!
New benchmark (A. Lancichinetti, S. F., F. Radicchi, Phys. Rev. E 78, , 2008) Power law distribution of degree Power law distribution of community size A mixing parameter μ t sets the ratio between the external and the total degree of each node The software to produce all new benchmarks is here: The benchmark can be extended to directed and weighted networks with overlapping communities (A. Lancichinetti, S. F., Phys. Rev. E 80, , 2009)
Algorithm Each node is given a degree from a power-law distribution with exponent τ 1 Community sizes are taken from power law distribution with exponent τ 2 Nodes are initially homeless, each node is assigned to a community, taken at random, such that s>k; if the community is complete, a random node of it is kicked out. The procedure continues until all nodes are assigned to communities A graph is built with the configuration model, such that the degree of each node is the internal community degree k int =(1-μ t )k and there are only internal links: so communities are initially disconnected n nodes, average degree
Finally, the links between communities are added. This is done by superimposing to the existing graph another graph whose nodes have degrees k ext =μ t k, built with the configuration model. The links of this new graph which end up within communities are eliminated with a rewiring procedure The benchmark can be extended to directed and weighted networks with overlapping communities (A. Lancichinetti, S. F., Phys. Rev. E 80, , 2009)
Computer time
The benchmark can be extended to directed and weighted networks with overlapping communities (A. Lancichinetti, S. F., Phys. Rev. E 80, , 2009) For directed networks, one has to reformulate the process for the indegree, for the outdegree we choose a δ-distribution For weighted networks, one has to specify two other parameters: an exponent β for the relation s ~ k β and the weighted mixing parameter μ w. First one builds the network and then one assigns the weights, by minimizing a cost function For the overlaps, a bipartite network is built to assign each node to one or more communities, with the configuration model For overlapping communities see also Sawardecker et al. (EPJB 67, 277, 2009)
The software to produce all new benchmarks is here:
Comparing partitions: normalized mutual information x i, y i : community assignments P(X=x)=n x /n, P(Y=y)=n y /n Joint distribution: P(X=x, Y=y)= n xy /n Shannon entropy of X: Shannon conditional entropy of X given Y:
Mutual information To avoid that: normalized mutual information Problem: the mutual information is identical for all Y which are subpartitions of X
What is the best algorithm? A comparative analysis (A. Lancichinetti, S.F., Phys. Rev. E 80, , 2009)
Divisive algorithms Principle: one removes the links that connect the clusters, until the latter are isolated How to identify intercommunity links? 1) Edge-betweenness (M. Girvan & M.E.J Newman, PNAS 99, , 2002) 2) Edge clustering coefficient (F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, PNAS 101, 2658, 2004)
Modularity = # links in module i = expected # of links in module i Newman & Girvan, Phys. Rev. E 69, , 2004
Infomap (Rosvall & Bergstrom, PNAS 105, 1118, 2008) Best partition minimum description length, optimization can be carried out with simulated annealing, greedy methods, etc.
Clique Percolation Method Palla, Derényi, Farkas & Vicsek, Nature 435, 814, (2005) Principle: in a graph with community structure there are many cliques within the clusters Cliques can be used as probes to explore the graph: 1) Two k-cliques are neighbors if they share a (k-1)-clique 2) One can travel along paths of neighboring cliques Cliques may be trapped within clusters, which can then be identified
Clique percolation method
What is the best algorithm? A comparative analysis (A. Lancichinetti, S.F., Phys. Rev. E 80, , 2009)
Tests on GN benchmark
Tests on LFR benchmark (undirected, unweighted)
Tests on LFR benchmark (directed, unweighted)
Tests on LFR benchmark (undirected, weighted)
Tests on random graphs
Outlook New benchmark graphs based on planted l-partition model (true community definition?): weighted/unweighted, directed/undirected and with overlapping communities Comparative analysis of existing methods on new benchmarks: the method by Rosvall and Bergstrom (PNAS 105, 1118, 2008) is the best: very good on the new benchmarks, it also recognizes random graphs, if the average degree is not too small, it is fast as well! Warning: benchmarks are characterized by “flat” clustering, there is no hierarchy! Low clustering coefficient too (work in progress) Crucial issue for the future: proper definition of hierarchical community structure and relative testing! Agreement on how to test algorithms is more crucial than designing algorithms!
S. F., arXiv: , Physics Reports 486, (2010)