Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome
INTRODUCTION Summary Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Introduction on basic notions of graphs and clustering 2 Introduction on clustering methods based on similarity/centrality 3 Introduction on clustering methods based on spectral analysis 4 The case of study of word association network 6 Conclusions and advertisements 5 The case of study of Wikipedia
INTRODUCTION 1.0 Basic matrix notation Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
INTRODUCTION 1.1 Clusters and Communities Generally a cluster corresponds to a community Some communities are hard to detect with clustering analysis Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
INTRODUCTION 1.2 Small graphs In order to detect communities, clustering is a good clue Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Clustering Coefficient Motifs
INTRODUCTION 1.2 Hubs and Authorities Sometimes vertices differ each other, according to their function Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th HITS hubs are those web pages that point to a large number of authorities (i.e. they have a large number of outgoing edges). authorities are those web pages pointed by a large number of hubs (i.e. they have a large number of ingoing edges). Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 604–632.
INTRODUCTION 1.3 Hubs and Authorities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th If every page i,j, has authority U i,j and hubness H ij We can divide the pages according to their value of U or H. These values are obtained by the eigenvalues of the matrices A T A and AA T respectively.
TOPOLOGICAL ANALYSIS One way to cluster vertices is to find similarites between them. One “topological” way is given by considering their neighbours. One can then define a distance x given by 2.1 Agglomerative Methods Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Brun, et al (2003). Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, R6 1–13.
TOPOLOGICAL ANALYSIS The Algorithm of Girvan and Newman selects recursively the largest edge-betweenness in the graph 2.2 Divisive Methods: betweenness Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The betweenness is a measure of the centrality of a vertex/edge in a graph Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. of Science (USA), 99, 7821–
TOPOLOGICAL ANALYSIS 2.3 Examples The procedure on a more complicated network, produces a dendrogram of the community structure Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 (a) friendship network from Zachary’s karate club study (26). Nodes associated with the club administrator’s faction are drawn as circles, those associated with the instructor’s faction are drawn as squares. (b) Hierarchical tree showing the complete community structure. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network. 6
TOPOLOGICAL ANALYSIS 2.3 Examples One typical example is that of the network. Below the case of study of University of Tarragona (Spain). Different colors correspond to different departments Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002). Self-similar community structure in organisations. Physical Review E, 68,
TOPOLOGICAL ANALYSIS 2.4 Random walks and communities Random walks on Graphs are at the basis of the PageRank algorithm (Google). This means that the largest is the probability to pass in a certain page the largest its interest. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Random walks can also be used to detect clusters in graphs, the idea is that the more closed is a subgraph, the largest the time a random walker need to escape from it. One of the heuristic algorithms based on random walks is the Markov Cluster (MCL) one. You find the complete description and codes at Start from the Normal Matrix, through matrix manipulation (power), one obtains a matrix for a n-steps connection. Enhance intercluster passages by raising the elements to a certain power and then normalize.
SPECTRAL ANALYSIS 2.3 MCL Technical Expansion corresponds to computing random walks of higher length, which means random walks with many steps. It associates new probabilities with all pairs of nodes, where one node is the point of departure and the other is the destination. Since higher length paths are more common within clusters than between different clusters, the probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one to the other. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Inflation will then have the effect of boosting the probabilities of intra-cluster walks and will demote inter-cluster walks. This is achieved without any a priori knowledge of cluster structure. It is simply the result of cluster structure being present.
SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Normal Matrix Laplacian Matrix 6
SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 If ’ = L The elements of matrix N give the probability with which one field passes from a vertex i to the neighbours. 6
SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th In a very clustered graph, the adjacency matrix can be put in a block form. 6
SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Given this probabilistic explanation for the matrix N We have a series of results, for example One eigenvalue is equal to one and The eigenvector related is constant. Consider the case of disconnected subclusters: The matrix N is made of blocks and a general eigenvector will be given by the space product of blocks eigenvectors (the constant can be different!) 6
SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 It is possible to express the eigenvectors problem as a research of a minimum under constraint where the x i are values assigned to nodes, with some constraint expressed by Stationary points of z(x) + constraint (A) → Lagrange multiplier (A) 1.Define a ficticious quantity x for the sites of the graph 2.Define a suitable function z on these x’s (a “distance”) 3.Define a suitable constraint on these x’s (to avoid having all equal or all 0) For example 6
SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Lagrange Multiplier = Normal Eigenvalue problem Lagrange Multiplier = Laplacian Eigenvalue problem 6
WORD ASSOCIATION NETWORK 4.1 The experimental data Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The data are collected through a psychological experiment: Persons (about 100) are given as a stimulus a single word i.e. “House”. They must answer with the first word that comes on their mind i.e.“Family”. Answer are later given as new stimula, so that a network of average associations forms. Steyvers, M. and Tenenbaum, J.B. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78. 6 A path from “Volcano” to “Ache”
WORD ASSOCIATION NETWORK 4.1 The experimental data Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005). Detecting communities in large networks. Physica A, 352, 669–676.. The number of connections (i.e. the degree of nodes) is power-law distributed 6
WORD ASSOCIATION NETWORK 4.2 The community structure Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 science1literature1piano1 scientific0.994dictionary0.994cello0.993 chemistry0.990editorial0.990fiddle0.992 physics0.988synopsis0.988viola0.990 concentrate0.973words0.987banjo0.988 thinking0.973grammar0.986saxophone0.985 test0.973adjective0.983director0.984 lab0.969chapter0.982violin0.983 brain0.965prose0.979clarinet0.983 equation0.963topic0.976oboe0.983 examine0.962English0.975theater0.982 Therefore we expect similar words to be on the same plateau. We can measure the correlation between the values of various vertices averaged over 10 different eigenvectors. 6
WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
WIKIPEDIA 5.1 Introduction A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica. Among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; the one in Britannica, about three. On the other hand the articles on Wikipedia are longer on average than those of Britannica. This accounts for a lower rate of errors in Wikipedia. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th
WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded. 6
WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 in–degree(empty) and out–degree(filled). Occurrency distributions for the Wikgraph in English (o) and Portuguese (). The Degree shows fat tails that can be approximated by a power- law function of the kind P(k) ~ k - g Where the exponent is the same both for in-degree and out- degree. In the case of WWW 2 ≤ g in ≤ Capocci, A., et al. (2006). Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E, 74,
WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (o) and Portuguese () As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour. 6
WIKIPEDIA 5.3 The growth of Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t. This is quantity is weighted by the factor N(t)/n(k,t) We find preferential attachment for in and out degree. English (o) and Portuguese (). White= in-degree Filled = out-degree 6
WIKIPEDIA 5.4 The communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Taxonomy Categorization provided gives an imposed taxonomy to the pages.
WIKIPEDIA 5.3 The Communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Given different wikigraphs one can compute the frequency of the category sizes in the various systems
WIKIPEDIA 5.3 The Communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Similarly, also the cluster size frequency distribution (computed with MCL algorithm) can be considered Qualitatively rather good agreement. But are there the same?
WIKIPEDIA Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th The Communities in Wikipedia NOT REALLY! The power-law shape is probably a very common feature for any categorization
SUMMARY Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Communities represents an important categorization of graphs. Methods to detect them varies according to the specific case of study SMALL GRAPHS (motifs, clustering coefficient) LARGE GRAPHS FUNCTION OF VERTICES (HITS, Vertex Similarity) CENTRALITY (Girvan Newman Algorithms) DIFFUSION ON THE GRAPH MCL Algorithm Spectral analysis of the stochastic matrices associated with the graph
Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 SHAMELESS ADVERTISEMENT 6
Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th SHAMELESS ADVERTISEMENT