SOCIAL NETWORKS and COMMUNITY DETECTION. “Networks” is a pervasive term? Networked Economy Immigrant Networks National Innovation Networks Networking.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

Community Detection and Graph-based Clustering
Clustering.
Network Matrix and Graph. Network Size Network size – a number of actors (nodes) in a network, usually denoted as k or n Size is critical for the structure.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
1 Greedy Forwarding in Dynamic Scale-Free Networks Embedded in Hyperbolic Metric Spaces Dmitri Krioukov CAIDA/UCSD Joint work with F. Papadopoulos, M.
Introduction to Social Network Analysis Lluís Coromina Departament d’Economia. Universitat de Girona Girona, 18/01/2005.
Activity relationship analysis
Community Detection and Evaluation
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Data Mining Techniques: Clustering
Feb 20, Definition of subgroups Definition of sub-groups: “Cohesive subgroups are subsets of actors among whom there are relatively strong, direct,
CONNECTIVITY “The connectivity of a network may be defined as the degree of completeness of the links between nodes” (Robinson and Bamford, 1978).
V4 Matrix algorithms and graph partitioning
Lecture 8 Communities Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Centrality and Prestige HCC Spring 2005 Wednesday, April 13, 2005 Aliseya Wright.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Fast algorithm for detecting community structure in networks.
A scalable multilevel algorithm for community structure detection
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Clustering Unsupervised learning Generating “classes”
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Lecture 18 Community structures Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
Social Media Mining Community Analysis.
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Victor Lee.  What are Social Networks?  Role and Position Analysis  Equivalence Models for Roles  Block Modelling.
Centrality in undirected networks These slides are by Prof. James Moody at Ohio State.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Data Structures & Algorithms Graphs
Centrality in Social Networks Background: At the individual level, one dimension of position in the network can be captured through centrality. Conceptually,
Clustering.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Slides are modified from Lada Adamic
Graphs & Matrices Todd Cromedy & Bruce Nicometo March 30, 2004.
Social Network Analysis. Outline l Background of social networks –Definition, examples and properties l Data in social networks –Data creation, flow and.
Selected Topics in Data Networking
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
OPTIMAL CONNECTIONS: STRENGTH AND DISTANCE IN VALUED GRAPHS Yang, Song and David Knoke RESEARCH QUESTION: How to identify optimal connections, that is,
Lecture 7 Communities Slides modified from Huan Liu, Lei Tang, Nitin Agarwal.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Informatics tools in network science
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Selected Topics in Data Networking Explore Social Networks:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Graph clustering to detect network modules
Social Media Analytics
Groups of vertices and Core-periphery structure
Social Networks Analysis
Department of Computer and IT Engineering University of Kurdistan
Greedy Algorithm for Community Detection
Network analysis.
Community detection in graphs
Network Science: A Short Introduction i3 Workshop
Peer-to-Peer and Social Networks
Resolution Limit in Community Detection
3.3 Network-Centric Community Detection
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

SOCIAL NETWORKS and COMMUNITY DETECTION

“Networks” is a pervasive term? Networked Economy Immigrant Networks National Innovation Networks Networking Entrepreneurial Networks Ego Networks Regional Networks Infrastructure Networks Social Networks

What Social Network Analysis is? Network Analysis is the keyword For the 21 st Century Researchers, Politicians, People talk about Social Networks.

What is a Network? (Web definition) A set of nodes, points, or locations connected by means of data, voice, and video communications for the purpose of exchange.

Real World Web Networks Internet World Wide Web. Citation Networks. Transportation Network. Food Webs. Social Networks. Biochemical Networks.

Social Networks A social network is a description of the social structure between actors, mostly individuals or organizations. It indicates the ways in which they are connected through various social familiarities ranging from casual acquaintance to close familiar bonds.

Marriage ties among Renaissance Florentine families A paleo-social network

Social Network Analysis Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities. It also includes community detection. The nodes in the network are the people and groups while the links (ties) show relationships or flows between the nodes. Community detection is discovering groups in the network where individuals’ group memberships are not explicitly given

The unit of interest in a network are the combined sets of actors and their relations. We represent actors with points and relations with lines. Actors are referred to variously as: Nodes, vertices or points Relations are referred to variously as: Edges, Arcs, Lines, Ties Example: a b ce d Social Network Data

SN = graph A network can then be represented as a graph data structure We can apply a variety of measures and analysis to the graph representing a given SN Ties in a SN can be directed or undirected (e.g. friendship, co- authorship are usually undirected, s are directed)

From graphs to matrices a b ce d Undirected, binary (0,1)Directed, binary a b ce d abcde a b c d e abcde a b c d e Basic Data Structures Social Network

From matrices to lists abcde a b c d e a b b a c c b d e d c e e c d a b b a b c c b c d c e d c d e e c e d Adjacency List Arc List Basic Data Structures Social Network

In general, a relation can be: Binary or Valued Directed or Undirected a b ce d Undirected, binary Directed, binary a b ce d a b ce d Undirected, Valued Directed, Valued a b ce d Social Network as a graph

Social Network Measuring the flow of information –Topology Connectivity Centrality –Time Structure & Social Space

In addition to the simple probability that one actor passes information on to another (p ij ), two factors affect flow through a network: Topology -the shape, or form, of the network - Example: one actor cannot pass information to another unless they are either directly or indirectly connected Time - the timing of contact matters - Example: an actor cannot pass information he has not receive yet Measuring Networks: Flow

Two features of the network’s topology are known to be important: connectivity and centrality Connectivity refers to how actors in one part of the network are connected to actors in another part of the network. Reachability: Is it possible for actor i to reach actor j? This can only be true if there is a chain of contact from one actor to another. Distance: Given they can be reached, how many steps are they from each other? Number of paths: How many different paths connect each pair? Measuring Networks: Topology

Without full network data, you can’t distinguish actors with limited information potential from those more deeply embedded in a setting. a b c Measuring Networks: Connectivity

d e c Indirect connections are what make networks systems. One actor can reach another if there is a path in the graph connecting them. a b ce d f bf a Reachability Measuring Networks: Connectivity Paths can be directed, leading to a distinction between “strong” and “weak” components

Basic elements in connectivity A path is a sequence of nodes and edges starting with one node and ending with another, tracing the indirect connection between the two. On a path, you never go backwards or revisit the same node twice. Example: a  b  c  d A walk is any sequence of nodes and edges, and may go backwards. Example: a  b  c  b  c  d A cycle is a path that starts and ends with the same node. Example: a  b  c  a Reachability Measuring Networks: Connectivity

Reachability If you can trace a sequence of relations from one actor to another, then the two are reachable. If there is at least one path connecting every pair of actors in the graph, the graph is connected and is called a component. Intuitively, a component is the set of people who are all connected by a chain of relations. Measuring Networks: Connectivity

This example contains many components. Reachability Measuring Networks: Connectivity

In general, components can be directed or undirected. For a graph with any directed edges, there are two types of components: Strong components consist of the set(s) of all nodes that are mutually reachable Weak components consist of the set(s) of all nodes where at least one node can reach the other. Reachability Measuring Networks: Connectivity (hidden)

There are only 2 strong components with more than 1 person in this network. Reachability Measuring Networks: Connectivity (hidden)

a Distance is measured by the (weighted) number of relations separating a pair: Actor “a” is: 1 step from 4 2 steps from 5 3 steps from 4 4 steps from 3 5 steps from 1 Distance & number of paths Measuring Networks: Connectivity

Paths are the different routes one can take. Node- independent paths are particularly important. a b There are 2 independent paths connecting a and b. There are many non-independent paths Distance & number of paths Measuring Networks: Flow

Probability of transfer by distance and number of paths, assume a constant p ij of Path distance probability 10 paths 5 paths 2 paths 1 path Distance & number of paths Measuring Networks: Connectivity

Centrality refers to (one dimension of) location, identifying where an actor resides in a network. For example, we can compare actors at the edge of the network to actors at the center. In general, this is a way to formalize intuitive notions about the distinction between insiders and outsiders. Centrality Measuring Networks: Centrality

Conceptually, centrality is fairly straight forward: we want to identify which nodes are in the ‘center’ of the network. In practice, identifying exactly what we mean by ‘center’ is somewhat complicated. Three standard centrality measures capture a wide range of “importance” in a network: Degree Closeness Betweenness Measuring Networks: Centrality

The most intuitive notion of centrality focuses on degree. Degree is the number of ties, and the actor with the most ties is the most important: Centrality Degree Measuring Networks: Centrality

Degree centrality, however, can be deceiving, because it is a purely local measure. Centrality Degree Measuring Networks: Centrality (hidden) 3 is intuitively more “central” than e.g. 5

If we want to measure the degree to which the graph as a whole is centralized, we look at the dispersion of centrality: Simple: variance of the individual centrality scores. Or, using Freeman’s general formula for centralization: C D (n*) is the maximum attained value of the same network size, therefore we are measuring the dispersion around that value Measuring Networks: Centrality

Degree Centralization Scores Freeman:.07 Variance:.20 Freeman: 1.0 Variance: 3.9 Freeman:.02 Variance:.17 Freeman: 0.0 Variance: 0.0 Measuring Networks: Centrality

Degree Centralization Scores Freeman: 0.1 Variance: 4.84 Measuring Networks: Centrality

A second measure of centrality is closeness centrality. An actor is considered important if he/she is relatively close to all other actors. Closeness is based on the inverse of the distance of each actor to every other actor in the network. Closeness Centrality: Normalized Closeness Centrality Measuring Networks: Centrality

Distance Closeness normalized Closeness Centrality in the examples Distance Closeness normalized Measuring Networks: Centrality

Distance Closeness normalized Closeness Centrality in the examples Measuring Networks: Centrality

Distance Closeness normalized Closeness Centrality in the examples Measuring Networks: Centrality

Closeness Centrality in the examples Measuring Networks: Centrality

Identify the set of all vertices A where the greatest distance d(A,B) to other vertices B is minimal. Value = longest distance to any other node. The graph theoretic center is ‘3’ Graph-teoretic center

Graph Theoretic Center (Barry or Jordan Center). Measuring Networks: Centrality

Betweenness Centrality: Model based on communication flow: A person who lies on communication paths can control communication flow, and is thus important. Betweenness centrality counts the number of geodesic paths between i and k that actor j resides on. Geodesics are defined as the shortest path between points b a C d e f g h Measuring Networks: Centrality

Betweenness Centrality: a bc d e f g h i j k l m a b d e f k m l m g h j i j c d e f k m l m g h j i j Measuring Networks: Centrality

Betweenness Centrality: Where g jk = the number of geodesics connecting jk, and g jk = the number that actor i is on. Usually normalized by: Measuring Networks: Centrality

Centralization: 1.0 Centralization:.31 Centralization:.59 Centralization: 0 Betweenness Centrality: Measuring Networks: Centrality

Centralization:.183 Measuring Networks: Centrality

Information Centrality: It is quite likely that information can flow through paths other than the geodesic. The Information Centrality score uses all paths in the network, and weights them based on their length. Measuring Networks: Centrality

Information Centrality: Measuring Networks: Centrality

( Node size proportional to betweenness centrality ) Actors that appear very different when seen individually, are comparable in the global network. Measuring Networks: Centrality

Two factors that affect network flows: Topology - the shape, or form, of the network - simple example: one actor cannot pass information to another unless they are either directly or indirectly connected Time - the timing of contacts matters - simple example: an actor cannot pass information he has not yet received. Time Measuring Networks: Time

Timing in networks A focus on contact structure has often slighted the importance of network dynamics, though a number of recent works are addressing this. Time affects networks in two important ways: 1)The structure itself evolves, in ways that will affect the topology an thus flow. 2) The timing of contact constrains information flow Measuring Networks: Time

Data on drug users in Colorado Springs, over 5 years Drug Relations, Colorado Springs, Year 1 Measuring Networks: Time

Drug Relations, Colorado Springs, Year 2 Current year in red, past relations in gray Measuring Networks: Time

Drug Relations, Colorado Springs, Year 3 Current year in red, past relations in gray Measuring Networks: Time

Drug Relations, Colorado Springs, Year 4 Current year in red, past relations in gray Measuring Networks: Time

Drug Relations, Colorado Springs, Year 5 Current year in red, past relations in gray Measuring Networks: Time

B C E DF A Numbers above lines indicate contact periods What impact does timing have on flow through the network? Measuring Networks: Time

B C E DF A The path graph for the hypothetical contact network While clearly important, this is not often handled well by current SNA software. Measuring Networks: Time

Measuring Networks: Structure & Social Space The second broad division for measuring networks steps back to generalized features of the global network. These factors almost always are of interest because of what they imply about how information moves through the network, but have resulted in a distinct line of methods and substantive research. The study of these generalized features is also known as community detection (small worlds analysis, etc.)

Community Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group –a.k.a. group, cluster, cohesive subgroup, module in different contexts Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given Why communities in social media? –Human beings are social –Easy-to-use social media allows people to extend their social life in unprecedented ways –Difficult to meet friends in the physical world, but much easier to find friend online with similar interests –Interactions between nodes can help determine communities 59

Communities in Social Media Two types of groups in social media –Explicit Groups: formed by user subscriptions –Implicit Groups: implicitly formed by social interactions Some social media sites allow people to join groups, is it necessary to extract groups based on network topology? –Not all sites provide community platform –Not all people want to make effort to join groups –Groups can change dynamically Network interaction provides rich information about the relationship between users –Can complement other kinds of information, e.g. user profile –Help network visualization and navigation –Provide basic information for other tasks, e.g. recommendation 60

Subjectivity of Community Definition Each component is a community A densely-knit community Definition of a community can be subjective. Definition of a community can be subjective. 61

Taxonomy of Community Criteria Criteria vary depending on the tasks Roughly, community detection methods can be divided into 4 categories (not exclusive): Node-Centric Community –Each node in a group satisfies certain properties Group-Centric Community –Consider the connections within a group as a whole. The group has to satisfy certain properties without zooming into node-level Network-Centric Community –Partition the whole network into several disjoint sets Hierarchy-Centric Community –Construct a hierarchical structure of communities 62

Node-Centric Community Detection Nodes satisfy different properties –Complete Mutuality cliques –Reachability of members k-clique, k-clan, k-club –Nodal degrees k-plex, k-core –Relative frequency of Within-Outside Ties LS sets, Lambda sets Commonly used in traditional social network analysis Here, we discuss some representative ones 63

Complete Mutuality: Cliques Clique: a maximum complete subgraph in which all nodes are adjacent to each other NP-hard to find the maximum clique in a network Straightforward implementation to find cliques is very expensive in time complexity Nodes 5, 6, 7 and 8 form a clique 64

Finding the Maximum Clique In a clique of size k, each node maintains degree >= k-1 –Nodes with degree < k-1 will not be included in the maximum clique Recursively apply the following pruning procedure –Sample a sub-network from the given network, and find a clique in the sub-network, say, by a greedy approach –Suppose the clique above is size k, in order to find out a larger clique, all nodes with degree <= k-1 should be removed. Repeat until the network is small enough Many nodes will be pruned as social media networks follow a power law distribution for node degrees 65

Maximum Clique Example Suppose we sample a sub-network with nodes {1-9} and find a clique {1, 2, 3} of size 3 In order to find a clique >3, remove all nodes with degree <=3-1=2 –Remove nodes 2 and 9 –Remove nodes 1 and 3 –Remove node 4 66

Clique Percolation Method (CPM) Clique is a very strict definition, unstable Normally use cliques as a core or a seed to find larger communities CPM is such a method to find overlapping communities –Input A parameter k, and a network –Procedure Find out all cliques of size k in a given network Construct a clique graph. Two cliques are adjacent if they share k-1 nodes Each connected components in the clique graph form a community 67

CPM Example Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8} Communities: {1, 2, 3, 4} {4, 5, 6, 7, 8} 68

Reachability : k-clique, k-club Any node in a group should be reachable in k hops k-clique: a maximal subgraph in which the largest geodesic distance between any two nodes <= k k-club: a substructure of diameter <= k A k-clique might have diameter larger than k in the subgraph –E.g. {1, 2, 3, 4, 5} Commonly used in traditional SNA Often involves combinatorial optimization Cliques: {1, 2, 3} 2-cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6} 2-clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6} 69

Group-Centric Community Detection: Density-Based Groups The group-centric criterion requires the whole group to satisfy a certain condition –E.g., the group density >= a given threshold A subgraph is a quasi-clique if where the denominator is the maximum number of degrees. A similar strategy to that of cliques can be used –Sample a subgraph, and find a maximal quasi- clique (say, of size ) –Remove nodes with degree less than the average degree 70, <

Network-Centric Community Detection Network-centric criterion needs to consider the connections within a network globally Goal: partition nodes of a network into disjoint sets Approaches: –(1) Clustering based on vertex similarity –(2) Latent space models (multi-dimensional scaling ) –(3) Block model approximation –(4) Spectral clustering –(5) Modularity maximization 71

Clustering based on Vertex Similarity Apply k-means or similarity-based clustering to nodes Vertex similarity is defined in terms of the similarity of their neighborhood Structural equivalence: two nodes are structurally equivalent iff they are connecting to the same set of actors Structural equivalence is too restrict for practical use. Nodes 1 and 3 are structurally equivalent; So are nodes 5 and (1) Clustering based on vertex similarity

Vertex Similarity Jaccard Similarity Cosine similarity 73 (1) Clustering based on vertex similarity

Vertex Similarity (2)

Clustering based on vertex similarity (K-means) Each cluster is associated with a centroid Each node is assigned to the cluster with the closest centroid

Cut Most interactions are within group whereas interactions between groups are few community detection  minimum cut problem Cut: A partition of vertices of a graph into two disjoint sets Minimum cut problem: find a graph partition such that the number of edges between the two sets is minimized 77

Ratio Cut & Normalized Cut Minimum cut often returns an imbalanced partition, with one set being a singleton, e.g. node 9 Change the objective function to consider community size C i, : a community |C i |: number of nodes in C i vol(C i ): sum of degrees in C i 78

Ratio Cut & Normalized Cut Example For partition in red: For partition in green: Both ratio cut and normalized cut prefer a balanced partition 79

Hierarchy-Centric Community Detection Goal: build a hierarchical structure of communities based on network topology Allow the analysis of a network at different resolutions Representative approaches: –Divisive Hierarchical Clustering (top-down) –Agglomerative Hierarchical clustering (bottom-up) 80

Divisive Hierarchical Clustering Divisive clustering –Partition nodes into several sets –Each set is further divided into smaller ones –Network-centric partition can be applied for the partition One particular example: recursively remove the “weakest” tie –Find the edge with the least strength –Remove the edge and update the corresponding strength of each edge Recursively apply the above two steps until a network is decomposed into desired number of connected components. Each component forms a community 81

Edge Betweenness The strength of a tie can be measured by edge betweenness Edge betweenness: the number of shortest paths that pass along with the edge The edge with higher betweenness tends to be the bridge between two communities. The edge betweenness of e(1, 2) is 4 (=6/2 + 1), as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2 82

Divisive clustering based on edge betweenness Initial betweenness value After remove e(4,5), the betweenness of e(4, 6) becomes 20, which is the highest; 83 Idea: progressively removing edges with the highest betweenness After remove e(4,6), the edge e(7,9) has the highest betweenness value 4, and should be removed.

Agglomerative Hierarchical Clustering Initialize each node as a community Merge communities successively into larger communities following a certain criterion –E.g., based on vertex similarity 85 Dendrogram according to Agglomerative Clustering based on Modularity

Summary of Community Detection Node-Centric Community Detection –cliques, k-cliques, k-clubs Group-Centric Community Detection –quasi-cliques Network-Centric Community Detection –Clustering based on vertex similarity Hierarchy-Centric Community Detection –Divisive clustering –Agglomerative clustering 87

COMMUNITY EVALUATION 88

Evaluating Community Detection (1) For groups with clear definitions –E.g., Cliques, k-cliques, k-clubs, quasi- cliques –Verify whether extracted communities satisfy the definition For networks with ground truth information –Normalized mutual information –Accuracy of pairwise community memberships 89

Measuring a Clustering Result The number of communities after grouping can be different from the ground truth No clear community correspondence between clustering result and the ground truth Normalized Mutual Information can be used Ground Truth 1, 2, 34, 5, 6 1, 324, 5, 6 Clustering Result How to measure the clustering quality? How to measure the clustering quality? 90

Accuracy of Pairwise Community Memberships Consider all the possible pairs of nodes and check whether they reside in the same community An error occurs if – Two nodes belonging to the same community are assigned to different communities after clustering –Two nodes belonging to different communities are assigned to the same community Construct a contingency table or confusion matrix 91

Accuracy Example Ground Truth C(v i ) = C(v j )C(v i ) ≠ C(v j ) Clustering Result C(v i ) = C(v j )40 C(v i ) ≠ C(v j )29 Ground Truth 1, 2, 3 4, 5, 6 1, 3 2 4, 5, 6 Clustering Result Accuracy = (4+9)/ ( ) = 13/15 92

Normalized Mutual Information Entropy: the information contained in a distribution Mutual Information: the shared information between two distributions Normalized Mutual Information (between 0 and 1) Consider a partition as a distribution (probability of one node falling into one community), we can compute the matching between the clustering result and the ground truth 93 or

ka, kb = clusters generati dalle partizioni πa, πb, h e l sono gli indici dei clusters nelle partizioni

NMI-Example Partition a: [1, 1, 1, 2, 2, 2] Partition b: [1, 2, 1, 3, 3, 3] 1, 2, 34, 5, 61, 324, 5,6 h=13 h=23 l=12 l=21 l=33 l=1l=2l=3 h=1210 h=2003 = Reference: contingency table or confusion matrix

Evaluation using Semantics For networks with semantics –Networks come with semantic or attribute information of nodes or connections –Human subjects can verify whether the extracted communities are coherent Evaluation is qualitative It is also intuitive and helps understand a community An animal community A health community 96

Evaluation without Ground Truth For networks without ground truth or semantic information This is the most common situation An option is to resort to cross-validation –Extract communities from a (training) network –Evaluate the quality of the community structure on a network constructed from a different date or based on a related type of interaction Quantitative evaluation functions –Modularity (M.Newman. Modularity and community structure in networks. PNAS 06.) –Link prediction (the predicted network is compared with the true network) 97