Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome.

Slides:



Advertisements
Similar presentations
Network analysis Sushmita Roy BMI/CS 576
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
报告人: 林 苑 指导老师:章忠志 副教授 复旦大学  Introduction about random walks  Concepts  Applications  Our works  Fixed-trap problem  Multi-trap problem.
Modularity and community structure in networks
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
V4 Matrix algorithms and graph partitioning
Hierarchy in networks Peter Náther, Mária Markošová, Boris Rudolf Vyjde : Physica A, dec
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
Emergence of Scaling in Random Networks Barabasi & Albert Science, 1999 Routing map of the internet
Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Fast algorithm for detecting community structure in networks.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Segmentation Graph-Theoretic Clustering.
Topologically biased random walks with application for community finding Vinko Zlatić Dep. Of Physics, “Sapienza”, Roma, Italia Theoretical Physics Division,
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
The Very Small World of the Well-connected. (19 june 2008 ) Lada Adamic School of Information University of Michigan Ann Arbor, MI
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Graph mining in bioinformatics Laur Tooming. Graphs in biology Graphs are often used in bioinformatics for describing processes in the cell Vertices are.
DYNAMICS OF COMPLEX SYSTEMS Self-similar phenomena and Networks Guido Caldarelli CNR-INFM Istituto dei Sistemi Complessi
Soon-Hyung Yook, Sungmin Lee, Yup Kim Kyung Hee University NSPCS 08 Unified centrality measure of complex networks.
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
DYNAMICS OF COMPLEX SYSTEMS Self-similar phenomena and Networks Guido Caldarelli CNR-INFM Istituto dei Sistemi Complessi
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Liang Ge.  Introduction  Important Concepts in MCL Algorithm  MCL Algorithm  The Features of MCL Algorithm  Summary.
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
1 Burning a graph as a model of social contagion Anthony Bonato Ryerson University Institute of Software Chinese Academy of Sciences.
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.
Emergence of Scaling and Assortative Mixing by Altruism Li Ping The Hong Kong PolyU
Soon-Hyung Yook, Sungmin Lee, Yup Kim Kyung Hee University NSPCS 08 Unified centrality measure of complex networks: a dynamical approach to a topological.
Markov Cluster (MCL) algorithm Stijn van Dongen.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Slides are modified from Lada Adamic
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Communities. Questions 1.What is a community (intuitively)? Examples and fundamental hypothesis 2.What do we really mean by communities? Basic definitions.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Miniconference on the Mathematics of Computation
Community detection via random walk Draft slides.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Response network emerging from simple perturbation Seung-Woo Son Complex System and Statistical Physics Lab., Dept. Physics, KAIST, Daejeon , Korea.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Graph clustering to detect network modules
Random Walk for Similarity Testing in Complex Networks
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Intrinsic Data Geometry from a Training Set
Groups of vertices and Core-periphery structure
Department of Computer and IT Engineering University of Kurdistan
Network analysis.
Community detection in graphs
Department of Computer Science University of York
Detecting Important Nodes to Community Structure
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Presentation transcript:

Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome

INTRODUCTION Summary Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Introduction on basic notions of graphs and clustering 2 Introduction on clustering methods based on similarity/centrality 3 Introduction on clustering methods based on spectral analysis 4 The case of study of word association network 6 Conclusions and advertisements 5 The case of study of Wikipedia

INTRODUCTION 1.0 Basic matrix notation Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

INTRODUCTION 1.1 Clusters and Communities Generally a cluster corresponds to a community Some communities are hard to detect with clustering analysis Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

INTRODUCTION 1.2 Small graphs In order to detect communities, clustering is a good clue Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Clustering Coefficient Motifs

INTRODUCTION 1.2 Hubs and Authorities Sometimes vertices differ each other, according to their function Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th HITS hubs are those web pages that point to a large number of authorities (i.e. they have a large number of outgoing edges). authorities are those web pages pointed by a large number of hubs (i.e. they have a large number of ingoing edges). Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 604–632.

INTRODUCTION 1.3 Hubs and Authorities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th If every page i,j, has authority U i,j and hubness H ij We can divide the pages according to their value of U or H. These values are obtained by the eigenvalues of the matrices A T A and AA T respectively.

TOPOLOGICAL ANALYSIS One way to cluster vertices is to find similarites between them. One “topological” way is given by considering their neighbours. One can then define a distance x given by 2.1 Agglomerative Methods Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Brun, et al (2003). Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5, R6 1–13.

TOPOLOGICAL ANALYSIS The Algorithm of Girvan and Newman selects recursively the largest edge-betweenness in the graph 2.2 Divisive Methods: betweenness Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The betweenness is a measure of the centrality of a vertex/edge in a graph Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proc. Natl. Acad. of Science (USA), 99, 7821–

TOPOLOGICAL ANALYSIS 2.3 Examples The procedure on a more complicated network, produces a dendrogram of the community structure Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 (a) friendship network from Zachary’s karate club study (26). Nodes associated with the club administrator’s faction are drawn as circles, those associated with the instructor’s faction are drawn as squares. (b) Hierarchical tree showing the complete community structure. (c) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network. 6

TOPOLOGICAL ANALYSIS 2.3 Examples One typical example is that of the network. Below the case of study of University of Tarragona (Spain). Different colors correspond to different departments Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002). Self-similar community structure in organisations. Physical Review E, 68,

TOPOLOGICAL ANALYSIS 2.4 Random walks and communities Random walks on Graphs are at the basis of the PageRank algorithm (Google). This means that the largest is the probability to pass in a certain page the largest its interest. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Random walks can also be used to detect clusters in graphs, the idea is that the more closed is a subgraph, the largest the time a random walker need to escape from it. One of the heuristic algorithms based on random walks is the Markov Cluster (MCL) one. You find the complete description and codes at Start from the Normal Matrix, through matrix manipulation (power), one obtains a matrix for a n-steps connection. Enhance intercluster passages by raising the elements to a certain power and then normalize.

SPECTRAL ANALYSIS 2.3 MCL Technical Expansion corresponds to computing random walks of higher length, which means random walks with many steps. It associates new probabilities with all pairs of nodes, where one node is the point of departure and the other is the destination. Since higher length paths are more common within clusters than between different clusters, the probabilities associated with node pairs lying in the same cluster will, in general, be relatively large as there are many ways of going from one to the other. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Inflation will then have the effect of boosting the probabilities of intra-cluster walks and will demote inter-cluster walks. This is achieved without any a priori knowledge of cluster structure. It is simply the result of cluster structure being present.

SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Normal Matrix Laplacian Matrix 6

SPECTRAL ANALYSIS 3.1 The functions of the adjacency matrix Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 If  ’ = L  The elements of matrix N give the probability with which one field  passes from a vertex i to the neighbours. 6

SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th In a very clustered graph, the adjacency matrix can be put in a block form. 6

SPECTRAL ANALYSIS 3.2 The block properties in clustered graphs Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Given this probabilistic explanation for the matrix N We have a series of results, for example One eigenvalue is equal to one and The eigenvector related is constant. Consider the case of disconnected subclusters: The matrix N is made of blocks and a general eigenvector will be given by the space product of blocks eigenvectors (the constant can be different!) 6

SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 It is possible to express the eigenvectors problem as a research of a minimum under constraint where the x i are values assigned to nodes, with some constraint expressed by Stationary points of z(x) + constraint (A) → Lagrange multiplier (A) 1.Define a ficticious quantity x for the sites of the graph 2.Define a suitable function z on these x’s (a “distance”) 3.Define a suitable constraint on these x’s (to avoid having all equal or all 0) For example 6

SPECTRAL ANALYSIS 3.3 Eigenvalues and Communities Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Lagrange Multiplier = Normal Eigenvalue problem Lagrange Multiplier = Laplacian Eigenvalue problem 6

WORD ASSOCIATION NETWORK 4.1 The experimental data Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The data are collected through a psychological experiment: Persons (about 100) are given as a stimulus a single word i.e. “House”. They must answer with the first word that comes on their mind i.e.“Family”. Answer are later given as new stimula, so that a network of average associations forms. Steyvers, M. and Tenenbaum, J.B. (2005). The large scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78. 6 A path from “Volcano” to “Ache”

WORD ASSOCIATION NETWORK 4.1 The experimental data Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005). Detecting communities in large networks. Physica A, 352, 669–676.. The number of connections (i.e. the degree of nodes) is power-law distributed 6

WORD ASSOCIATION NETWORK 4.2 The community structure Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 science1literature1piano1 scientific0.994dictionary0.994cello0.993 chemistry0.990editorial0.990fiddle0.992 physics0.988synopsis0.988viola0.990 concentrate0.973words0.987banjo0.988 thinking0.973grammar0.986saxophone0.985 test0.973adjective0.983director0.984 lab0.969chapter0.982violin0.983 brain0.965prose0.979clarinet0.983 equation0.963topic0.976oboe0.983 examine0.962English0.975theater0.982 Therefore we expect similar words to be on the same plateau. We can measure the correlation between the values of various vertices averaged over 10 different eigenvectors. 6

WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

WIKIPEDIA 5.1 Introduction Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

WIKIPEDIA 5.1 Introduction A Nature investigation aimed to find if Wikipedia is an authoritative source of information with respect to established sources as Encyclopedia Britannica. Among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; the one in Britannica, about three. On the other hand the articles on Wikipedia are longer on average than those of Britannica. This accounts for a lower rate of errors in Wikipedia. Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th

WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and wikiPT, generated from the English, German, French, Spanish, Italian and Portuguese datasets, respectively. The graphs were obtained from an old dump of June 13, We are not using the current data due to disk space restrictions. The English dataset of June 2005 has more than 36 GB compacted, that is about 200 GB expanded. 6

WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 in–degree(empty) and out–degree(filled). Occurrency distributions for the Wikgraph in English (o) and Portuguese (). The Degree shows fat tails that can be approximated by a power- law function of the kind P(k) ~ k - g Where the exponent is the same both for in-degree and out- degree. In the case of WWW 2 ≤ g in ≤ Capocci, A., et al. (2006). Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E, 74,

WIKIPEDIA 5.2 The network properties Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 The average neighbors’ in–degree, computed along incoming edges, as a function of the in–degree for the English (o) and Portuguese () As regards the assortativity (as measured by the average degree of the neighbours of a vertex with degree k) there is no evidence of any assortative behaviour. 6

WIKIPEDIA 5.3 The growth of Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 Given the history of growth one can verify the hypothesis of preferential attachment. This is done by means of the histogram P(k) who gives the number of vertices (whose degree is k) acquiring new connections at time t. This is quantity is weighted by the factor N(t)/n(k,t) We find preferential attachment for in and out degree. English (o) and Portuguese (). White= in-degree Filled = out-degree 6

WIKIPEDIA 5.4 The communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Taxonomy Categorization provided gives an imposed taxonomy to the pages.

WIKIPEDIA 5.3 The Communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Given different wikigraphs one can compute the frequency of the category sizes in the various systems

WIKIPEDIA 5.3 The Communities in Wikipedia Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Similarly, also the cluster size frequency distribution (computed with MCL algorithm) can be considered Qualitatively rather good agreement. But are there the same?

WIKIPEDIA Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th The Communities in Wikipedia NOT REALLY! The power-law shape is probably a very common feature for any categorization

SUMMARY Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th Communities represents an important categorization of graphs. Methods to detect them varies according to the specific case of study SMALL GRAPHS (motifs, clustering coefficient) LARGE GRAPHS FUNCTION OF VERTICES (HITS, Vertex Similarity) CENTRALITY (Girvan Newman Algorithms) DIFFUSION ON THE GRAPH MCL Algorithm Spectral analysis of the stochastic matrices associated with the graph

Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th 2007 SHAMELESS ADVERTISEMENT 6

Guido Caldarelli, Communities and Clustering in Some social Networks NetSci 2007 New York, May 20 th SHAMELESS ADVERTISEMENT