Presentation is loading. Please wait.

Presentation is loading. Please wait.

Community Structures. My T. Thai 2 What is Community Structure  Definition:  A community is a group of nodes in which:  There are.

Similar presentations


Presentation on theme: "Community Structures. My T. Thai 2 What is Community Structure  Definition:  A community is a group of nodes in which:  There are."— Presentation transcript:

1 Community Structures

2 My T. Thai mythai@cise.ufl.edu 2 What is Community Structure  Definition:  A community is a group of nodes in which:  There are more edges (interactions) between nodes within the group than to nodes outside of it

3 My T. Thai mythai@cise.ufl.edu 3 Why Community Structure (CS)?  Many systems can be expressed by a network, in which nodes represent the objects and edges represent the relations between them:  Social networks: collaboration, online social networks  Technological networks: IP address networks, WWW, software dependency  Biological networks: protein interaction networks, metabolic networks, gene regulatory networks

4 Why CS? My T. Thai mythai@cise.ufl.edu 4 Yeast Protein interaction networks

5 Why CS? My T. Thai mythai@cise.ufl.edu 5 IP address network

6 My T. Thai mythai@cise.ufl.edu 6 Why Community Structure?  Nodes in a community have some common properties  Communities represent some properties of a networks  Examples:  In social networks, represent social groupings based on interest or background  In citation networks, represent related papers on one topic  In metabolic networks, represent cycles and other functional groupings

7 My T. Thai mythai@cise.ufl.edu 7 An Overview of Recent Work  Disjoint CS  Overlapping CS  Centralized Approach  Define the quantity of modularity and use the greedy algorithms, IP, SDP, Spectral, Random walk, Clique percolation  Localized Approach  Handle Dynamics and Evolution  Incorporate other information

8 Graph Partitioning? It’s not  Graph partitioning algorithms are typically based on minimum cut approaches or spectral partitioning

9 Graph Partitioning  Minimum cut partitioning breaks down when we don’t know the sizes of the groups - Optimizing the cut size with the groups sizes free puts all vertices in the same group  Cut size is the wrong thing to optimize - A good division into communities is not just one where there are a small number of edges between groups  There must be a smaller than expected number edges between communities

10 My T. Thai mythai@cise.ufl.edu 10 Edge Betweeness  Focus on the edges which are least central, i.e.,, the edges which are most “between” communities  Instead of adding edge to G = (V, emptyset), progressively removing edges from an original graph G = (V,E)

11 My T. Thai mythai@cise.ufl.edu 11 Edge Betweeness  Definition:  For each edge (u,v), the edge betweeness of (u,v) is defined as the number of shortest paths between any pair of nodes in a network that run through (u,v)  betweeness(u,v) = | { P xy | x, y in V, P xy is a shortest path between x and y, and (u,v) in P xy }|

12 My T. Thai mythai@cise.ufl.edu 12 Why Edge Betweeness

13 My T. Thai mythai@cise.ufl.edu 13 Algorithm  Initialize G = (V,E) representing a network  while E is not empty  Calculate the betweeness of all edges in G  Remove the edge e with the highest betweeness, G = (V, E – e)  Indeed, we just need to recalculate the betweeness of all edges affected by the removal

14 My T. Thai mythai@cise.ufl.edu 14 Time Complexity  Let |V| = n and |E| = m  Calculate the betweeness of all edges: O(mn)  Since we need to recalculate each time we remove an edge: O(m 2 n)

15 My T. Thai mythai@cise.ufl.edu 15 An Example

16 My T. Thai mythai@cise.ufl.edu 16 Disadvantages/Improvements  Can we improve the time complexity?  The communities are in the hierarchical form, can we find the disjoint communities?

17 My T. Thai mythai@cise.ufl.edu 17 Define the quantity (measurement) of modularity Q and find an approximation algorithm to maximize Q

18 Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004Aaron ClausetM. E. J. NewmanCristopher Moore  Consider edges that fall within a community or between a community and the rest of the network  Define modularity: probability of an edge between two vertices is proportional to their degrees if vertices are in the same community adjacency matrix For a random network, Q = 0 the number of edges within a community is no different from what you would expect

19 Finding community structure in very large networks Authors: Aaron Clauset, M. E. J. Newman, Cristopher Moore 2004Aaron ClausetM. E. J. NewmanCristopher Moore  Algorithm  start with all vertices as isolates  follow a greedy strategy :  successively join clusters with the greatest increase  Q in modularity  stop when the maximum possible  Q <= 0 from joining any two  successfully used to find community structure in a graph with > 400,000 nodes with > 2 million edges  Amazon’s people who bought this also bought that…  alternatives to achieving optimum  Q:  simulated annealing rather than greedy search

20 Extensions to weighted networks  Betweenness clustering?  Will not work – strong ties will have a disproportionate number of short paths, and those are the ones we want to keep  Modularity (Analysis of weighted networks, M. E. J. Newman) reuters new articles keywords weighted edge

21 Structural Quality Coverage Modularity Conductance Inter-cluster conductance Average conductance There is no single perfect quality function. [Almedia et al. 2011]

22 l s : # links inside module s L : # links in the network d s : The total degree of the nodes in module s : Expected # of links in module s Resolution Limit 22

23  Modularity seems to have some intrinsic scale of order, which constrains the number and the size of the modules.  For a given total number of nodes and links we could build many more than modules, but the corresponding network would be less “modular”, namely with a value of the modularity lower than the maximum 23 The Limit of Modularity

24 24 The Resolution Limit Since M 1 and M 2 are constructed modules, we have

25 Let’s consider the following case Q A : M 1 and M 2 are separate modules Q B : M 1 and M 2 is a single module Since both M 1 and M 2 are modules by construction, we need That is, 25 The Resolution Limit (cont)

26 Now let’s see how it contradicts the constructed modules M 1 and M 2 We consider the following two scenarios: ( ) The two modules have a perfect balance between internal and external degree (a 1 +b 1 =2, a 2 +b 2 =2), so they are on the edge between being or not being communities, in the weak sense. The two modules have the smallest possible external degree, which means that there is a single link connecting them to the rest of the network and only one link connecting each other (a 1 =a 2 =b 1 =b 2 =1/l). 26 The Resolution Limit (cont)

27 When and, the right side of can reach the maximum value In this case, may happen. 27 Scenario 1 (cont)

28 a 1 =a 2 =b 1 =b 2 =1/l 28 Scenario 2 (cont)

29 For example, p=5, m=20 The maximal modularity of the network corresponds to the partition in which the two smaller cliques are merged 29 Schematic Examples (cont)

30 Fix the resolution?  Uncover communities of different sizes My T. Thai mythai@cise.ufl.edu 30

31  Blondel (Louvian method), [Blondel et al. 2008]  Fast Modularity Optimization  Hierarchical clustering  Infomap, [Rosvall & Bergstrom 2008]  Maps of Random Walks  Flow-based and information theoretic  InfoH (InfoHiermap), [Rosvall & Bergstrom 2011]  Multilevel Compression of Random Walks  Hierarchical version of Infomap Community Detection Algorithms

32  RN, [Ronhovde & Nussinov 2009]  Potts Model Community Detection  Minimization of Hamiltonian of an Potts model spin system  MCL, [Dongen 2000]  Markov Clustering  Random walks stay longer in dense clusters  LC, [Ahn et al. 2010]  Link Community Detection  A community is redefined as a set of closely interrelated edges  Overlapping and hierarchical clustering Community Detection Algorithms

33 My T. Thai mythai@cise.ufl.edu 33 Blondel et al  Two Phases:  Phase 1:  Initially, we have n communities (each node is a community)  For each node i, consider the neighbor j of i and evaluate the modularity gain that would take place by placing i in the community of j.  Node i will be placed in one of the communities for which this gain is maximum (and positive)  Stop this process when no further improvement can be achieved  Phase 2:  Compress each community into a node and thus, constructing a new graph representing the community structures after phase 1  Re-apply Phase 1

34 My T. Thai mythai@cise.ufl.edu 34

35 My T. Thai mythai@cise.ufl.edu 35

36 State-of-the-art methods  Evaluated by Lancichinetti, Fortunato, Physical Review E 09  Infomap[ Rosvall and Bergstrom, PNAS 07 ]  Blondel’s method [ Blondel et. al, J. of Statistical Mechanics: Theory and Experiment 08 ]  Ronhovde & Nussinov’s method (RN) [ Phys. Rev. E, 09 ]  Many other recent heuristics  OSLOM, QCA… No Provable Performance Guarantee Need Approximation Algorithms 36

37 Power-Law Networks 37

38 PLNs Model P(α, β) 38

39 LDF Algorithm – The Basis u u v w x y z 39

40 LDF Algorithm 40

41 An Example of LDF 41

42 Theorem: Sketch of the proof 42

43 LDF Undirected -Theorem 43

44 D-LDF – Directed Networks u u v v 44

45 D-LDF – Directed Networks u u v v u u v v 45

46 LDF-Directed Networks 46

47 Dynamic Community Structure tt+1t+2 Time move more edges merge Network evolution 47

48 Quantifying social group evolution (Palla et. al – Nature 07)  Developed an algorithm based on clique percolation -> allows to investigate the time dependence of overlapping communties  Uncover basic relationships characterizing community evolution  Understand the development and self-optimization 48

49 Findings  Fundamental diffs b/w the dynamics of small and large groups  Large groups persists for longer; capable of dynamically altering their membership  Small groups: their composition remains unchanged in order to be stable  Knowledge of the time commitment of members to a given community can be used for estimating the community’s lifetime 49

50 50

51 51

52 52

53 Research Problems  How to update the evolving community structure (CS) without re-computing it  Why?  Prohibitive computational costs for re-computing  Introduce incorrect evolution phenomena  How to predict new relationships based on the evolving of CS 53

54 An Adaptive Model Input network Network changes Basic communities Basic CS Updated communities : : Need to handle –Node insertion –Edge insertion –Node removal –Edge removal 54

55 Related Work in Dynamic Networks  GraphScope [J. Sun et al., KDD 2007]  FacetNet [Y-R. Lin et al., WWW 2008]  Bayesian inference approach [T. Yang et al., J. Machine Learning, 2010]  QCA [N. P. Nguyen and M.T. Thai, INFOCOM 2011]  OSLOM [A. Lancichinetti et al., PLoS ONE, 2011]  AFOCS [Nguyen at el, Mobicom 2011] 55

56 An Adaptive Algorithm for Overlapping Input network Network changes Basic communities Phase 1: Basic CS detection (  ) Updated communities Phase 2: Adaptive CS update (  ) Our solution: AFOCS: A 2-phase and limited input dependent framework N. Nguyen and M. T. Thai, ACM MobiCom 2011 56

57 Phase 1: Basic Communities Detection  Basic communities  Dense parts of the networks  Can possibly overlap  Bases for adaptive CS update  Duties  Locates basic communities  Merges them if they are highly overlapped 57

58 Phase 1: Basic Communities Detection Locating basic communities: when  (C)   (C)  (C) = 0.9   (C) =0.725 Merging: when OS(C i, C j )   OS(C i, C j ) = 1.027   = 0.75 58

59 Phase 1: Basic Communities Detection 59

60 Phase 2: Adaptive CS Update  Update network communities when changes are introduced Network changes Basic communities Updated communities Need to handle –Adding a node/edge –Removing a node/edge + Locally locate new local communities + Merge them if they highly overlap with current ones 60

61 Phase 2: Adding a New Node u u u 61

62 Phase 2: Adding a New Edge 62

63 Phase 2: Removing a Node Identify the left-over structure(s) on C\{u} Merge overlapping substructure(s) 63

64 Phase 2: Removing an Edge Identify the left-over structure(s) on C\{u,v} Merge overlapping substructure(s) 64

65 AFOCS performance: Choosing β 65

66 AFOCS v.s. Static Detection + CFinder [G. Palla et al., Nature 2005] + COPRA [S. Gregory, New J. of Physics, 2010] 66

67 AFOCS v.s. Other Dynamic Methods + iLCD [R. Cazabet et al., SOCIALCOM 2010] 67

68 Adaptive CS Detection in Dynamic Networks  Running time is proportional to the amount of changes  Can be locally employed  More consistent community structure: Critical for applications such as routing. 4. Changes in the Network 5. Output CS 6. Compact Representation Graph (CRG) 1. Initial Network START 3. Refine CS 68

69 b b a a 3 b b a a 10 28 16 2 2 z z y y x x t t t t y y x x z z 10 20 1212 2 2 b b 2 t t y y x x a a z z 10 20 12 2 2 b b 2 t t y y x x a a z z Adaptive CS Detection in Dynamic Networks 69

70 A-LDF – Dynamic Network Algorithm Changes in the Network Output CS Compact Representation Graph Initial Network START Refine CS Both selected as the LDF algorithm (without the refining phase) Compact representation: Label nodes that represents communities with leader. Unlabel all pulled out nodes (nodes that are incident to changes). Both selected as the LDF algorithm (without the refining phase) Compact representation: Label nodes that represents communities with leader. Unlabel all pulled out nodes (nodes that are incident to changes). 70

71 A-LDF – Dynamic Network Algorithm 71

72 Experimental Results  Datasets  Static data sets: Karate Club, Dolphin, Twitter, Flickr,.etc  Dynamic social networks:  Facebook (New Orleans): 63 K nodes, 1.5 M edges  ArXiv Citation network: 225 K articles, ~40 K new each year 72

73 Static Networks Size # VerticesEdges 1Karate3478 2Dolphin62159 3Les Miserables77254 4Political Books105441 5Ame. Col. Fb.115613 6Elec. Cir. S838512819 7Erdos Scie. Collab.6,1009,939 8Foursquare44,8321,664,402 9Facebook63,731905,565 10Twitter88,4842,364,322 11Fllickr80,5135,899,882 73

74 Performance Evaluation 74

75 Evaluation in Dynamic Networks 75

76 Evaluation in Dynamic Networks 76

77 Incorporate other information  Social connections  Friendship (mutal) relation (Facebook, Google+)  Follower (unidirectional) relation (Twitter) 77

78 Incorporate other information  The discussed topics  Topics that people in a group are mostly interested 78

79 Incorporate other information  Social interactions types  Wall posts, private or group messages (Facebook)  Tweets, retweets (Twitter)  Comments 79

80 In rich-content social networks  Not only the topology that matters But also,  User interests  A user may interested in many communities  Community interests  A community may interested in different topics 80

81 In rich-content social networks  Communities = “groups of users who are interconnected and communicate on shared topics”  interconnected by social connection and interaction types  Given a social network with  Network topology  Users and their social connections and interactions  Topics of interests  How can we find meaningful communities as well as their topics of interests? 81

82 Approaches  Use Bayesian models to extract latent communities  Topic User Community Model  Posts/Tweets can be broadcasted  Topic User Recipient Community Model  Posts/Tweets are restricted to desired users only  Full Topic User Recipient Community Model  A post/tweet generated by a user can be based on multiple topics 82

83 Assumptions  A user can belong to multiple communities  A community can participate in multiple topics  For TUCM and TURCM  Posts in general discuss one topic only  Full TURCM  Posts can discuss multiple topics 83

84 Background  Multinormial distribution – Mult(.)  n trials  k possible outcomes with prob. p 1, p 2,…, p k sum up to 1  X 1, X 2,.., X k ( X i denote the number of times outcome #i appears in n trials ) 84

85 Multinormal distribution 85

86 Symmetric Dirichlet Distribution  Dir K (α) where α = (α 1, …, α K ) on variable x 1, x2, …, x K where x K = 1 – (x 1 +..+x K-1 ) has prob. 86

87 Notations 87 Observation variables Latent variables

88 Notations (cont’d) 88

89 Topic User Community Model  Social Interaction Profile - SIP(u i ) 89  The SIP of users is represented as random mixtures over latent community variables  Each community is in turn defined as a distribution over the interaction space

90 Topic User Community Model 90 1 2

91 Topic User Community Model 91 3a3b

92 TUCM  Model presentation 92  A Bayesian decomposition

93 TUCM – Parameter Estimation 93

94 TUCM – Parameter Estimation 94

95 TUCM – Parameter Estimation 95

96 Topic User Recipient Community  This model  Does not allow mass messaging  The sender typically sends out messages to his/her acquaintances  The post are on a topic that both sender and recipient are interested in.  In the same spirit of TUCM  Now we have user u j for all u j in R i 96

97 TURC 97

98 Full TURC Model  Previous models  Assume that each post generated by a user is based on a single topic  Full TURC  Relaxes this requirement  Communities how have a higher relationship to authors 98

99 Full TURC Model 99 1 3 2

100 Full TURC Model 100

101 Experiments  Data  6 month of Twitter in 2009  5405 nodes, 13214 edges, 23043 posts  Enron email  150 nodes, ~300K emails in total  Number of communities C = 10  Number of topics = 20  Competitor methods: CUT and CART 101

102 Results 102

103 Results 103

104 Results 104


Download ppt "Community Structures. My T. Thai 2 What is Community Structure  Definition:  A community is a group of nodes in which:  There are."

Similar presentations


Ads by Google