3.3 Network-Centric Community Detection

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Class 12: Communities Network Science: Communities Dr. Baruch Barzel.
Fast algorithm for detecting community structure in networks M. E. J. Newman Department of Physics and Center for the Study of Complex Systems, University.
Community Detection and Graph-based Clustering
Social network partition Presenter: Xiaofei Cao Partick Berg.
Clustering.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Modularity and community structure in networks
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Community Detection and Evaluation
Graph Partitioning Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Nodes, Ties and Influence
V4 Matrix algorithms and graph partitioning
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
Clustering II.
Lecture 8 Communities Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Fast algorithm for detecting community structure in networks.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Unsupervised learning Generating “classes”
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
SOCIAL NETWORKS and COMMUNITY DETECTION. “Networks” is a pervasive term? Networked Economy Immigrant Networks National Innovation Networks Networking.
Lecture 18 Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
Social Media Mining Community Analysis.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering.
Communities. Questions 1.What is a community (intuitively)? Examples and fundamental hypothesis 2.What do we really mean by communities? Basic definitions.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Network Community Behavior to Infer Human Activities.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Lecture 7 Communities Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
3.3 Network-Centric Community Detection  Network-Centric Community Detection –consider the global topology of a network. –It aims to partition nodes of.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Department of Computer and IT Engineering University of Kurdistan Social Network Analysis Communities By: Dr. Alireza Abdollahpouri.
Graph clustering to detect network modules
Social Media Analytics
Hierarchical Agglomerative Clustering on graphs
Greedy Algorithm for Community Detection
Community detection in graphs
Peer-to-Peer and Social Networks
Finding modules on graphs
Michael L. Nelson CS 495/595 Old Dominion University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
3.3 Network-Centric Community Detection
Text Categorization Berlin Chen 2003 Reference:
“Traditional” image segmentation
Clustering.
Presentation transcript:

3.3 Network-Centric Community Detection A Unified Process

3.3 Network-Centric Community Detection Comparison Spectral clustering essentially tries to minimize the number of edges between groups. Modularity consider the number of edges which is smaller than expected. The spectral partitioning is forced to split the network into approximately equal-size clusters.

3.4 Hierarchy-Centric Community Detection Hierarchy-centric methods build a hierarchical structure of communities based on network topology two types of hierarchical clustering Divisive Agglomerative Divisive Clustering 1. Put all objects in one cluster 2. Repeat until all clusters are singletons a) choose a cluster to split what criterion? b) replace the chosen cluster with the sub-clusters split into how many?

3.4 Hierarchy-Centric Community Detection Divisive Clustering A Method: Cut the “weakest” tie At each iteration, find out the weakest edge. This kind of edge is most likely to be a tie connecting two communities. Remove the edge. Once a network is decomposed into two connected components, each component is considered a community. Update the strength of links. This iterative process is applied to each community to find sub-communities.

3.4 Hierarchy-Centric Community Detection Divisive Clustering “Finding and evaluating community structure in networks,” M. Newman and M. Girvan, Physical Review, 2004 find the weak ties based on “edge betweenness” Edge betweenness the number of shortest paths between pair of nodes pass along the edge utilized to find the “weakest” tie for hierarchical clustering where 𝜎 𝑠𝑡 is the total number of shortest paths between nodes 𝑣 𝑠 and 𝑣 𝑡 𝜎 𝑠𝑡 ( 𝑒(𝑣 𝑖 , 𝑣 𝑗 )) is the number of shortest paths between nodes 𝑣 𝑠 and 𝑣 𝑡 that pass along the edge 𝑒(𝑣 𝑖 , 𝑣 𝑗 ). 𝐶 𝐵 𝑒( 𝑣 𝑖 , 𝑣 𝑗 ) = 𝑣 𝑠 , 𝑣 𝑡 ∈𝑉, 𝑠<𝑡 𝜎 𝑠𝑡 𝑒( 𝑣 𝑖 , 𝑣 𝑗 ) 𝜎 𝑠𝑡 𝑖𝑓 𝑖<𝑗 0 𝑖𝑓 𝑖=𝑗 𝐶 𝐵 𝑒( 𝑣 𝑗 , 𝑣 𝑖 ) 𝑖𝑓 𝑖>𝑗

3.4 Hierarchy-Centric Community Detection Divisive Clustering The edge with higher betweenness tends to be the bridge between two communities It is used to progressively remove the edges with the highest betweenness.

3.4 Hierarchy-Centric Community Detection Divisive Clustering “Finding and evaluating community structure in networks,” M. Newman and M. Girvan, Physical Review, 2004 Example Negatives for divisive clustering edge betweenness-based scheme requires high computation One removal of an edge will lead to the recomputation of betweenness for all edges

3.4 Hierarchy-Centric Community Detection Agglomerative Clustering begins with base (singleton) communities merges them into larger communities with certain criterion. One example criterion: modularity Let 𝑒 𝑖𝑗 be the fraction of edges in the network that connect nodes in community 𝑖 to those in community 𝑗 Let 𝑎 𝑖 = 𝑗 𝑒 𝑖𝑗 , then the modularity 𝑸= 𝒊 ( 𝒆 𝒊𝒊 − 𝒂 𝒊 𝟐 ) values approaching 𝑄=1 indicate networks with strong community structure values for real networks typically fall in the range from 0.3 to 0.7 동일한 Community 안의 Edge 수 – 서로 다른 Community 들 간의 Edge 수

3.4 Hierarchy-Centric Community Detection Agglomerative Clustering Two communities are merged if the merge results in the largest increase of overall modularity The merge continues until no merge can be found to improve the modularity. Dendrogram according to Agglomerative Clustering based on Modularity

3.4 Hierarchy-Centric Community Detection Agglomerative Clustering In the dendrogram, the circles at the bottom represent the individual nodes of the network. As we move up the tree, the nodes join together to form larger and larger communities, as indicated by the lines, until we reach the top, where all are joined together in a single community. Alternatively, the dendrogram depicts an initially connected network splitting into smaller and smaller communities as we go from top to bottom. A cross section of the tree at any level, such the one indicated by a dotted line, will give the communities at that level.

3.4 Hierarchy-Centric Community Detection Divisive vs. Agglomerative Clustering Zachary's karate club study Zachary observed 34 members of a karate club over a period of two years. During the course of the study, a disagreement developed between the administrator (34) of the club and the club's instructor (1), which ultimately resulted in the instructor's leaving and starting a new club, taking about a half of the original club's members with him

3.4 Hierarchy-Centric Community Detection Divisive vs. Agglomerative Clustering Divisive “Community structure in social and biological networks”, Michelle Girvan, and M. E. J. Newman, 2001  Using edge-betweeness Agglomerative “Fast algorithm for detecting community structure in networks”, M. E. J. Newman, 2003  Using modularity Divisive Agglomerative

Summary of Community Detection Node-Centric Community Detection cliques, k-cliques, k-clubs Group-Centric Community Detection quasi-cliques Network-Centric Community Detection Clustering based on vertex similarity Latent space models, block models, spectral clustering, modularity maximization Hierarchy-Centric Community Detection Divisive clustering Agglomerative clustering

3.5 Community Evaluation Here, we consider a “Social Network with Ground Truth” Community membership for each actor is known  an ideal case For example, A synthetic networks generated based on predefined community structures L. Tang and H. Liu. “Graph mining applications to social network analysis.” In C. Aggarwal and H.Wang, editors, Managing and MiningGraph Data, chapter 16, pages 487.513.Springer, 2010b Some well-studied tiny networks like Zachary’s karate club with 34 members M.Newman. “Modularity and community structure in networks.” PNAS, 103(23):8577.8582, 2006a. Simple comparison between the ground truth with the identified community structure Visualization One-to-one mapping

3.5 Community Evaluation The number of communities after grouping can be different from the ground truth No clear community correspondence between clustering result and the ground truth Normalized Mutual Information (NMI) can be used How to measure the clustering quality? Each number denotes a node, and each circle or block denotes a community  1) Both communities {1, 3} and {2} map to the community {1, 2, 3} in the ground truth 2) The node 2 is wrongly assigned

3.5 Community Evaluation Entropy 확률변수의 불확실성을 측정하기 위한 것 Measure of disorder The information volume contained in a random variable X (or in a distribution X) X의 엔트로피는 X의 모든 가능한 결과값 x에 대해 x의 발생 확률과 그 확률의 역수의 로그 값의 곱의 합 일반적으로 지수 b의 값으로서 2나 오일러의 수 e, 또는 10이 많이 사용된다. b=2인 경우에는 엔트로피의 단위가 비트(bit)이며, b=e이면 네트(nat), 그리고 b=10인 경우에는 디짓(digit)이 된다. 𝐻 𝑋 =− 𝑥∈𝑋 𝑝 𝑥 𝑙𝑜𝑔 𝑏 (𝑥)

3.5 Community Evaluation Entropy와 동전 던지기 [from wikipedia] 앞면과 뒷면이 나올 확률이 같은 동전을 던졌을 경우의 엔트로피를 생각해 보자. 이는 H,T 두 가지의 경우만을 나타내므로 엔트로피는 1이다. 𝐻 𝑋 =− 𝑥∈𝑋 𝑝 𝑥 𝑙𝑜𝑔 𝑏 𝑥 = −( 1 2 × 𝑙𝑜𝑔 2 1 2 + 1 2 × 𝑙𝑜𝑔 2 1 2 )=1 한편 공정하지 않는 동전의 경우에는 특정 면이 나올 확률이 상대적으로 더 높기 때문에 엔트로피는 1보다 작아진다. 우리가 예측해서 맞출 수 있는 확률이 더 높아졌기 때문에 정보의 양, 즉 엔트로피는 더 작아진 것이다. 동전던지기의 경우에는 앞,뒤 면이 나올 확률이 1/2로 같은 동전이 엔트로피가 가장 크다. 엔트로피를 불확실성(uncertainity)과 같은 개념이라고 인식할 수 있다. 불확실성이 높아질수록 정보의 양은 더 많아지고 엔트로피는 더 커진다.

3.5 Community Evaluation Mutual Information (상호 정보량) It measures the shared information volume between two random variables (or two distributions) 두 확률 변수 (또는 두 분포) X, Y가 얼마나 밀접한 관계가 있는지 또는 얼마나 서로간에 의존을 하는지를 측정 국문 참고 문헌 http://shineware.tistory.com/7 http://www.dbpia.co.kr/Journal/ArticleDetail/339089

3.5 Community Evaluation Normalized Mutual Information (NMI, 정규화된 상호 정보량) It measures the shared information volume between two random variables (or two distributions) 두 확률 변수 (또는 두 분포) X, Y가 얼마나 밀접한 관계가 있는지를 측정 The values is between 0 and 1 Consider a partition as a random variable, we can compute the matching quality between ground truth and the identified clustering

3.5 Community Evaluation NMI Example (1/2) Partition a ( 𝜋 𝑎 ): [1, 1, 1, 2, 2, 2] Partition b ( 𝜋 𝑏 ): [1, 2, 1, 3, 3, 3] 𝜋 𝑎 1, 2, 3 4, 5, 6 𝜋 𝑏 1, 3 2 4, 5,6

3.5 Community Evaluation =0.8278 NMI Example (2/2) Partition a ( 𝜋 𝑎 ): [1, 1, 1, 2, 2, 2] Partition b ( 𝜋 𝑏 ): [1, 2, 1, 3, 3, 3] 𝜋 𝑎 1, 2, 3 4, 5, 6 𝜋 𝑏 1, 3 2 4, 5,6 =0.8278

3.5 Community Evaluation Accuracy of Pairwise Community Memberships Consider all the possible pairs of nodes and check whether they reside in the same community An error occurs if Two nodes belonging to the same community are assigned to different communities after clustering Two nodes belonging to different communities are assigned to the same community Construct a contingency table

3.5 Community Evaluation Accuracy = (4+9)/ (4+2+9+0) = 0.86 Accuracy of Pairwise Community Memberships Ground Truth 1, 2, 3 4, 5, 6 1, 3 2 Clustering Result Accuracy = (4+9)/ (4+2+9+0) = 0.86

3.5 Community Evaluation Accuracy of Pairwise Community Memberships Balanced Accuracy (BAC) = 1 – Balanced Error Rate (BER) 𝐵𝐴𝐶= 1 2 𝑎 𝑎+𝑐 + 𝑑 𝑏+𝑑 =1 −𝐵𝐸𝑅 𝐵𝐸𝑅= 1 2 ( 𝑐 𝑎+𝑐 + 𝑏 𝑏+𝑑 ) This measure assigns equal importance to “false positives” and “false negatives”, so that trivial or random predictions incur an error of 0.5 on average.

3.5 Community Evaluation 𝐵𝐴𝐶= 1 2 4 6 + 9 9 = 0.83 Accuracy of Pairwise Community Memberships Balanced Accuracy (BAC) = 1 – Balanced Error Rate (BER) 𝐵𝐴𝐶= 1 2 𝑎 𝑎+𝑐 + 𝑑 𝑏+𝑑 =1 −𝐵𝐸𝑅 𝐵𝐸𝑅= 1 2 ( 𝑐 𝑎+𝑐 + 𝑏 𝑏+𝑑 ) 𝐵𝐴𝐶= 1 2 4 6 + 9 9 = 0.83

동일한 Community 안의 Edge 수 – 서로 다른 Community 들 간의 Edge 수 3.5 Community Evaluation Evaluation without Ground Truth This is the most common situation Quantitative evaluation functions: modularity Once we have a network partition, we can compute its modularity The method with higher modularity wins modularity Let 𝑒 𝑖𝑗 be the fraction of edges in the network that connect nodes in community 𝑖 to those in community 𝑗 Let 𝑎 𝑖 = 𝑗 𝑒 𝑖𝑗 , then the modularity 𝑸= 𝒊 ( 𝒆 𝒊𝒊 − 𝒂 𝒊 𝟐 ) values approaching 𝑄=1 indicate networks with strong community structure values for real networks typically fall in the range from 0.3 to 0.7 동일한 Community 안의 Edge 수 – 서로 다른 Community 들 간의 Edge 수

Book Available at Morgan & claypool Publishers Amazon If you have any comments, please feel free to contact: Lei Tang, Yahoo! Labs, ltang@yahoo-inc.com Huan Liu, ASU huanliu@asu.edu