Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overcoming Resolution Limits in MDL Community Detection

Similar presentations


Presentation on theme: "Overcoming Resolution Limits in MDL Community Detection"— Presentation transcript:

1 Overcoming Resolution Limits in MDL Community Detection
L. Karl Branting The MITRE Corporation

2 Outline Utility functions in community detection Resolution limits
MDL-based community detection Previous: RB and AP New: SGE Experimental Evaluation Lessons

3 Utility functions in community detection
Two components of community detection algorithms Utility function – quality criterion to be optimized Search strategy – procedure for finding optimal partition Examples Garvin & Newman (2003) Utility function: modularity Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge) Newman (2003) Search strategy: greedy agglomerative hierarchical clustering (iteratively choose highest modularity merge) Tasgin & Bingol (2006) Search strategy: genetic algorithm

4 Utility functions in community detection
Other search strategies used with modularity Rattigan, Maier, Jensen (2007) Utility function: modularity Search strategy: Greedy divisive hierarchical clustering using a Network Structured Index to approximation edge betweenness Donetti & Munoz (2004) Search strategy: greedy agglomerative hierarchical clustering with spectral division

5 Utility functions in community detection
Statistical Approaches Zhang, Qiu, Giles, Foley, & Yen (2007) Utility function: log-likelihood (LDA parameters) Search strategy: fixed-point iteration Compression-Based Approaches Rosvall & Bergstrom (2007) Utility function: Minimum Description Length Search strategy: simulated annealing Chakrabarti (2004) Search strategy: exhaustive search for k, hill-climbing given k Utility function implicit in search strategy Raghavan, Albert, & Kumara (2007) – marker passing Cliques, cores, etc.

6 Modularity W(Dii) = number of edges internal to group i li = number of edges incident to vertices in group I l = total number of edges Intuitive – expresses intuition that ratio of internal to external edges is greater for groups than for non-groups Popular Imperfect Fortunato & Barthelemy (2007) Resolution limit: groups conflated if number of vertices less than Rosvall & Bergstrom (2007) Biased towards same-sized groups

7 Resolution Limit Ring graph R15,4
15 communities 4 nodes per community Community structure that maximizes modularity conflates groups

8 Approaches to modularity’s resolution limit
Apply recursively to large communities (Ruan & Zhang 2007) Apply locally (Clauset 2005) Choose a different utility function

9 Description Length Utility of community structure is sum of bits needed to represent Community structure + Graph given community structure Search strategy attempts to minimize description length There is no unique bit count Undecidability of Kolmogorov complexity Previous approaches Rosvall & Bergstrom (2007): RB Handles group size skew better than modularity Chakrabarti (2004): AP Comparison Similar breakdown of bits Different calculation

10 Components of Description
Components (details in paper) Bits to represent number of nodes in graph ignored because not specific to community structure Bits to represent number of groups Bits to represent mapping between nodes and groups Bits needed for number of group-to-group edges Bits needed for adjacencies between nodes Purpose 2, 3, 4: represent group structure 1, 5: represent graph as a whole

11 Surprising Experimental Result
RB, AP, and modularity compared as utility functions Applied to ring graphs Rm,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge) Unsurprising result. Modularity led to conflated groups for: m > 8 and c = 3 m > 10 and c = 4 m > 11 and c = 5 m > 13 and c = 6,7 Surprising result. Both RB and AP conflated at least one pair of groups in every Rm,c!

12 Hypothesis Both RB and AP require at least one bit per pair of groups in term 4 Perhaps this estimation causes group conflation Term 4 grows as the square of the number of groups If graph is sparse, conflating groups may save more in term 4 reduction than it costs in term 5 increase Components Bits to represent number of nodes in graph ignored because not specific to community structure Bits to represent number of groups Bits to represent mapping between nodes and groups Bits needed for number of group-to-group edges Bits needed for adjacencies between nodes

13 SGE (Sparse Graph Encoding)
Components Bits to represent number of nodes in graph Ignored, as in RB and AP Bits to represent number of groups Follows RB Bits to represent mapping between nodes and groups Similar to AP Bits needed for number of group to group edges Split into 2 terms Which pairs of groups are connected (much less than one bit per pair if pairs sparsely or densely connected) Number of edges between connected groups Grows as number of connected pairs, not total number of pairs Bits needed for adjacencies between nodes

14 Performance of SGE on Ring Graphs
Correct community structure found for every Rm,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 except R4,3 R13,3 Results confirm hypothesis that resolution limit in RB and AP is result of over-counting term 4: the bits needed for group-to-group edges Significance Ring graphs rare in real world How does SGE compare on more realistic graphs?

15 Uniform random graph Similar to graphs in Rosvall & Bergstrom (2007)
Test set 32 vertices 4 groups average degree 6 size ratio {1.0,1.25,1.5,1.75,2.0} Proportion internal edges {0.6,0.75,0.9} Example: size ratio 1.25 Proportion internal edges 0.67

16 Embedded Barabasi-Albert Graphs
Test set 4 communities separately generated by preferential attachment In each community 4 initial vertices 2-4 edges added per time step 20 time steps Example 4 communities 3 edges added per time step

17 Evaluation Criteria Rand index (Rand 1971)
Adjusted Rand index (Hubert & Arabie 1985) F-measure – based on same-cluster pairs Recall = Precision = F-measure =

18 Results: Uniform random graph

19 Results: Uniform random graph

20 Results: Uniform random graph

21 Results: Embedded Barabasi-Albert

22 Summary of Evaluation Random graphs EBA graphs
Community structure is weak Group sizes are balanced – modularity is best Group sizes are imbalanced – RS is best (as per Rosvall & Bergstrom 2007) Community structure is strong Group sizes are balanced – not much difference Group sizes are imbalanced – modularity is particularly bad (as per Rosvall & Bergstrom 2007), SGE slightly better than RS and AP EBA graphs Sparse – AP and SGE weaker than modularity and RS Dense – essentially identical accuracy

23 Conclusion Narrow Broad
Conflation of groups by MDL in sparse graphs (e.g., ring graphs) can be avoided by adjusting group-to-group edge counts. This change doesn’t hurt performance in more common types of graphs. Compression-based clustering works well, but requires tinkering Modularity detects weak structure well when graph not too big and groups not too imbalanced Broad Still unclear what utility function is best overall Needed: theory relating graph typology to utility functions


Download ppt "Overcoming Resolution Limits in MDL Community Detection"

Similar presentations


Ads by Google