Download presentation
Presentation is loading. Please wait.
Published byDavid Carr Modified over 9 years ago
1
Community Structures
2
My T. Thai mythai@cise.ufl.edu 2 What is Community Structure Definition: A community is a group of nodes in which: There are more edges (interactions) between nodes within the group than to nodes outside of it
3
My T. Thai mythai@cise.ufl.edu 3 Why Community Structure (CS)? Many systems can be expressed by a network, in which nodes represent the objects and edges represent the relations between them: Social networks: collaboration, online social networks Technological networks: IP address networks, WWW, software dependency Biological networks: protein interaction networks, metabolic networks, gene regulatory networks
4
Why CS? My T. Thai mythai@cise.ufl.edu 4 Yeast Protein interaction networks
5
Why CS? My T. Thai mythai@cise.ufl.edu 5 IP address network
6
My T. Thai mythai@cise.ufl.edu 6 Why Community Structure? Nodes in a community have some common properties Communities represent some properties of a networks Examples: In social networks, represent social groupings based on interest or background In citation networks, represent related papers on one topic In metabolic networks, represent cycles and other functional groupings
7
My T. Thai mythai@cise.ufl.edu 7 An Overview of Recent Work Disjoint CS Overlapping CS Centralized Approach Define the quantity of modularity and use the greedy algorithms, IP, SDP, Spectral, Random walk, Clique percolation Localized Approach Handle Dynamics and Evolution Incorporate other information
8
Graph Partitioning? It’s not Graph partitioning algorithms are typically based on minimum cut approaches or spectral partitioning
9
Graph Partitioning Minimum cut partitioning breaks down when we don’t know the sizes of the groups - Optimizing the cut size with the groups sizes free puts all vertices in the same group Cut size is the wrong thing to optimize - A good division into communities is not just one where there are a small number of edges between groups There must be a smaller than expected number edges between communities
10
My T. Thai mythai@cise.ufl.edu 10 Edge Betweeness Focus on the edges which are least central, i.e.,, the edges which are most “between” communities Instead of adding edge to G = (V, emptyset), progressively removing edges from an original graph G = (V,E)
11
My T. Thai mythai@cise.ufl.edu 11 Edge Betweeness Definition: For each edge (u,v), the edge betweeness of (u,v) is defined as the number of shortest paths between any pair of nodes in a network that run through (u,v) betweeness(u,v) = | { P xy | x, y in V, P xy is a shortest path between x and y, and (u,v) in P xy }|
12
My T. Thai mythai@cise.ufl.edu 12 Why Edge Betweeness
13
My T. Thai mythai@cise.ufl.edu 13 Algorithm Initialize G = (V,E) representing a network while E is not empty Calculate the betweeness of all edges in G Remove the edge e with the highest betweeness, G = (V, E – e) Indeed, we just need to recalculate the betweeness of all edges affected by the removal
14
My T. Thai mythai@cise.ufl.edu 14 Time Complexity Let |V| = n and |E| = m Calculate the betweeness of all edges: O(mn) Since we need to recalculate each time we remove an edge: O(m 2 n)
15
My T. Thai mythai@cise.ufl.edu 15 An Example
16
My T. Thai mythai@cise.ufl.edu 16 Disadvantages/Improvements Can we improve the time complexity? The communities are in the hierarchical form, can we find the disjoint communities?
17
My T. Thai mythai@cise.ufl.edu 17 Define the quantity (measurement) of modularity Q and find an approximation algorithm to maximize Q
18
Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004Aaron ClausetM. E. J. NewmanCristopher Moore Consider edges that fall within a community or between a community and the rest of the network Define modularity: probability of an edge between two vertices is proportional to their degrees if vertices are in the same community adjacency matrix For a random network, Q = 0 the number of edges within a community is no different from what you would expect
19
Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004Aaron ClausetM. E. J. NewmanCristopher Moore Algorithm start with all vertices as isolates follow a greedy strategy : successively join clusters with the greatest increase Q in modularity stop when the maximum possible Q <= 0 from joining any two successfully used to find community structure in a graph with > 400,000 nodes with > 2 million edges Amazon’s people who bought this also bought that… alternatives to achieving optimum Q: simulated annealing rather than greedy search
20
Extensions to weighted networks Betweenness clustering? Will not work – strong ties will have a disproportionate number of short paths, and those are the ones we want to keep Modularity (Analysis of weighted networks, M. E. J. Newman) reuters new articles keywords weighted edge
21
Structural Quality Coverage Modularity Conductance Inter-cluster conductance Average conductance There is no single perfect quality function. [Almedia et al. 2011]
22
l s : # links inside module s L : # links in the network d s : The total degree of the nodes in module s : Expected # of links in module s Resolution Limit 22
23
Modularity seems to have some intrinsic scale of order, which constrains the number and the size of the modules. For a given total number of nodes and links we could build many more than modules, but the corresponding network would be less “modular”, namely with a value of the modularity lower than the maximum 23 The Limit of Modularity
24
24 The Resolution Limit Since M 1 and M 2 are constructed modules, we have
25
Let’s consider the following case Q A : M 1 and M 2 are separate modules Q B : M 1 and M 2 is a single module Since both M 1 and M 2 are modules by construction, we need That is, 25 The Resolution Limit (cont)
26
Now let’s see how it contradicts the constructed modules M 1 and M 2 We consider the following two scenarios: ( ) The two modules have a perfect balance between internal and external degree (a 1 +b 1 =2, a 2 +b 2 =2), so they are on the edge between being or not being communities, in the weak sense. The two modules have the smallest possible external degree, which means that there is a single link connecting them to the rest of the network and only one link connecting each other (a 1 =a 2 =b 1 =b 2 =1/l). 26 The Resolution Limit (cont)
27
When and, the right side of can reach the maximum value In this case, may happen. 27 Scenario 1 (cont)
28
a 1 =a 2 =b 1 =b 2 =1/l 28 Scenario 2 (cont)
29
For example, p=5, m=20 The maximal modularity of the network corresponds to the partition in which the two smaller cliques are merged 29 Schematic Examples (cont)
30
Fix the resolution? Uncover communities of different sizes My T. Thai mythai@cise.ufl.edu 30
31
Blondel (Louvian method), [Blondel et al. 2008] Fast Modularity Optimization Hierarchical clustering Infomap, [Rosvall & Bergstrom 2008] Maps of Random Walks Flow-based and information theoretic InfoH (InfoHiermap), [Rosvall & Bergstrom 2011] Multilevel Compression of Random Walks Hierarchical version of Infomap Community Detection Algorithms
32
RN, [Ronhovde & Nussinov 2009] Potts Model Community Detection Minimization of Hamiltonian of an Potts model spin system MCL, [Dongen 2000] Markov Clustering Random walks stay longer in dense clusters LC, [Ahn et al. 2010] Link Community Detection A community is redefined as a set of closely interrelated edges Overlapping and hierarchical clustering Community Detection Algorithms
33
My T. Thai mythai@cise.ufl.edu 33 Blondel et al Two Phases: Phase 1: Initially, we have n communities (each node is a community) For each node i, consider the neighbor j of i and evaluate the modularity gain that would take place by placing i in the community of j. Node i will be placed in one of the communities for which this gain is maximum (and positive) Stop this process when no further improvement can be achieved Phase 2: Compress each community into a node and thus, constructing a new graph representing the community structures after phase 1 Re-apply Phase 1
34
My T. Thai mythai@cise.ufl.edu 34
35
My T. Thai mythai@cise.ufl.edu 35
36
State-of-the-art methods Evaluated by Lancichinetti, Fortunato, Physical Review E 09 Infomap[ Rosvall and Bergstrom, PNAS 07 ] Blondel’s method [ Blondel et. al, J. of Statistical Mechanics: Theory and Experiment 08 ] Ronhovde & Nussinov’s method (RN) [ Phys. Rev. E, 09 ] Many other recent heuristics OSLOM, QCA… No Provable Performance Guarantee Need Approximation Algorithms 36
37
Power-Law Networks 37
38
PLNs Model P(α, β) 38
39
LDF Algorithm – The Basis u u v w x y z 39
40
LDF Algorithm 40
41
An Example of LDF 41
42
Theorem: Sketch of the proof 42
43
LDF Undirected -Theorem 43
44
D-LDF – Directed Networks u u v v 44
45
D-LDF – Directed Networks u u v v u u v v 45
46
LDF-Directed Networks 46
47
Dynamic Community Structure tt+1t+2 Time move more edges merge Network evolution 47
48
Quantifying social group evolution (Palla et. al – Nature 07) Developed an algorithm based on clique percolation -> allows to investigate the time dependence of overlapping communties Uncover basic relationships characterizing community evolution Understand the development and self-optimization 48
49
Findings Fundamental diffs b/w the dynamics of small and large groups Large groups persists for longer; capable of dynamically altering their membership Small groups: their composition remains unchanged in order to be stable Knowledge of the time commitment of members to a given community can be used for estimating the community’s lifetime 49
50
50
51
51
52
52
53
Research Problems How to update the evolving community structure (CS) without re-computing it Why? Prohibitive computational costs for re-computing Introduce incorrect evolution phenomena How to predict new relationships based on the evolving of CS 53
54
An Adaptive Model Input network Network changes Basic communities Basic CS Updated communities : : Need to handle –Node insertion –Edge insertion –Node removal –Edge removal 54
55
Related Work in Dynamic Networks GraphScope [J. Sun et al., KDD 2007] FacetNet [Y-R. Lin et al., WWW 2008] Bayesian inference approach [T. Yang et al., J. Machine Learning, 2010] QCA [N. P. Nguyen and M.T. Thai, INFOCOM 2011] OSLOM [A. Lancichinetti et al., PLoS ONE, 2011] AFOCS [Nguyen at el, Mobicom 2011] 55
56
An Adaptive Algorithm for Overlapping Input network Network changes Basic communities Phase 1: Basic CS detection ( ) Updated communities Phase 2: Adaptive CS update ( ) Our solution: AFOCS: A 2-phase and limited input dependent framework N. Nguyen and M. T. Thai, ACM MobiCom 2011 56
57
Phase 1: Basic Communities Detection Basic communities Dense parts of the networks Can possibly overlap Bases for adaptive CS update Duties Locates basic communities Merges them if they are highly overlapped 57
58
Phase 1: Basic Communities Detection Locating basic communities: when (C) (C) (C) = 0.9 (C) =0.725 Merging: when OS(C i, C j ) OS(C i, C j ) = 1.027 = 0.75 58
59
Phase 1: Basic Communities Detection 59
60
Phase 2: Adaptive CS Update Update network communities when changes are introduced Network changes Basic communities Updated communities Need to handle –Adding a node/edge –Removing a node/edge + Locally locate new local communities + Merge them if they highly overlap with current ones 60
61
Phase 2: Adding a New Node u u u 61
62
Phase 2: Adding a New Edge 62
63
Phase 2: Removing a Node Identify the left-over structure(s) on C\{u} Merge overlapping substructure(s) 63
64
Phase 2: Removing an Edge Identify the left-over structure(s) on C\{u,v} Merge overlapping substructure(s) 64
65
AFOCS performance: Choosing β 65
66
AFOCS v.s. Static Detection + CFinder [G. Palla et al., Nature 2005] + COPRA [S. Gregory, New J. of Physics, 2010] 66
67
AFOCS v.s. Other Dynamic Methods + iLCD [R. Cazabet et al., SOCIALCOM 2010] 67
68
Adaptive CS Detection in Dynamic Networks Running time is proportional to the amount of changes Can be locally employed More consistent community structure: Critical for applications such as routing. 4. Changes in the Network 5. Output CS 6. Compact Representation Graph (CRG) 1. Initial Network START 3. Refine CS 68
69
b b a a 3 b b a a 10 28 16 2 2 z z y y x x t t t t y y x x z z 10 20 1212 2 2 b b 2 t t y y x x a a z z 10 20 12 2 2 b b 2 t t y y x x a a z z Adaptive CS Detection in Dynamic Networks 69
70
A-LDF – Dynamic Network Algorithm Changes in the Network Output CS Compact Representation Graph Initial Network START Refine CS Both selected as the LDF algorithm (without the refining phase) Compact representation: Label nodes that represents communities with leader. Unlabel all pulled out nodes (nodes that are incident to changes). Both selected as the LDF algorithm (without the refining phase) Compact representation: Label nodes that represents communities with leader. Unlabel all pulled out nodes (nodes that are incident to changes). 70
71
A-LDF – Dynamic Network Algorithm 71
72
Experimental Results Datasets Static data sets: Karate Club, Dolphin, Twitter, Flickr,.etc Dynamic social networks: Facebook (New Orleans): 63 K nodes, 1.5 M edges ArXiv Citation network: 225 K articles, ~40 K new each year 72
73
Static Networks Size # VerticesEdges 1Karate3478 2Dolphin62159 3Les Miserables77254 4Political Books105441 5Ame. Col. Fb.115613 6Elec. Cir. S838512819 7Erdos Scie. Collab.6,1009,939 8Foursquare44,8321,664,402 9Facebook63,731905,565 10Twitter88,4842,364,322 11Fllickr80,5135,899,882 73
74
Performance Evaluation 74
75
Evaluation in Dynamic Networks 75
76
Evaluation in Dynamic Networks 76
77
Incorporate other information Social connections Friendship (mutal) relation (Facebook, Google+) Follower (unidirectional) relation (Twitter) 77
78
Incorporate other information The discussed topics Topics that people in a group are mostly interested 78
79
Incorporate other information Social interactions types Wall posts, private or group messages (Facebook) Tweets, retweets (Twitter) Comments 79
80
In rich-content social networks Not only the topology that matters But also, User interests A user may interested in many communities Community interests A community may interested in different topics 80
81
In rich-content social networks Communities = “groups of users who are interconnected and communicate on shared topics” interconnected by social connection and interaction types Given a social network with Network topology Users and their social connections and interactions Topics of interests How can we find meaningful communities as well as their topics of interests? 81
82
Approaches Use Bayesian models to extract latent communities Topic User Community Model Posts/Tweets can be broadcasted Topic User Recipient Community Model Posts/Tweets are restricted to desired users only Full Topic User Recipient Community Model A post/tweet generated by a user can be based on multiple topics 82
83
Assumptions A user can belong to multiple communities A community can participate in multiple topics For TUCM and TURCM Posts in general discuss one topic only Full TURCM Posts can discuss multiple topics 83
84
Background Multinormial distribution – Mult(.) n trials k possible outcomes with prob. p 1, p 2,…, p k sum up to 1 X 1, X 2,.., X k ( X i denote the number of times outcome #i appears in n trials ) 84
85
Multinormal distribution 85
86
Symmetric Dirichlet Distribution Dir K (α) where α = (α 1, …, α K ) on variable x 1, x2, …, x K where x K = 1 – (x 1 +..+x K-1 ) has prob. 86
87
Notations 87 Observation variables Latent variables
88
Notations (cont’d) 88
89
Topic User Community Model Social Interaction Profile - SIP(u i ) 89 The SIP of users is represented as random mixtures over latent community variables Each community is in turn defined as a distribution over the interaction space
90
Topic User Community Model 90 1 2
91
Topic User Community Model 91 3a3b
92
TUCM Model presentation 92 A Bayesian decomposition
93
TUCM – Parameter Estimation 93
94
TUCM – Parameter Estimation 94
95
TUCM – Parameter Estimation 95
96
Topic User Recipient Community This model Does not allow mass messaging The sender typically sends out messages to his/her acquaintances The post are on a topic that both sender and recipient are interested in. In the same spirit of TUCM Now we have user u j for all u j in R i 96
97
TURC 97
98
Full TURC Model Previous models Assume that each post generated by a user is based on a single topic Full TURC Relaxes this requirement Communities how have a higher relationship to authors 98
99
Full TURC Model 99 1 3 2
100
Full TURC Model 100
101
Experiments Data 6 month of Twitter in 2009 5405 nodes, 13214 edges, 23043 posts Enron email 150 nodes, ~300K emails in total Number of communities C = 10 Number of topics = 20 Competitor methods: CUT and CART 101
102
Results 102
103
Results 103
104
Results 104
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.