EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs Zhe Jin.

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Leting Wu Xiaowei Ying, Xintao Wu Aidong Lu and Zhi-Hua Zhou PAKDD 2011 Spectral Analysis of k-balanced Signed Graphs 1.
Analysis and Modeling of Social Networks Foudalis Ilias.
Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University March 22th, 2011 Smart grid seminar series Yao Liu, Peng Ning, and Michael K.
Modularity and community structure in networks
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
1 Connectivity Structure of Bipartite Graphs via the KNC-Plot Erik Vee joint work with Ravi Kumar, Andrew Tomkins.
Kronecker Graphs: An Approach to Modeling Networks Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani Presented.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
V4 Matrix algorithms and graph partitioning
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CS 584. Review n Systems of equations and finite element methods are related.
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Fast algorithm for detecting community structure in networks.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
Segmentation Graph-Theoretic Clustering.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.
Computer Science 1 Web as a graph Anna Karpovsky.
COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.
Clustering Unsupervised learning Generating “classes”
Models of Influence in Online Social Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Lecture7 Topic1: Graph spectral analysis/Graph spectral clustering and its application to metabolic networks Topic 2: Concept of Line Graphs Topic 3: Introduction.
Spectral coordinate of node u is its location in the k -dimensional spectral space: Spectral coordinates: The i ’th component of the spectral coordinate.
Efficient Gathering of Correlated Data in Sensor Networks
1 Graph Embedding (GE) & Marginal Fisher Analysis (MFA) 吳沛勳 劉冠成 韓仁智
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
G52IIP, School of Computer Science, University of Nottingham 1 Edge Detection and Image Segmentation.
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
G52IVG, School of Computer Science, University of Nottingham 1 Edge Detection and Image Segmentation.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
Jure Leskovec Kevin J. Lang Anirban Dasgupta Michael W. Mahoney WWW’ 2008 Statistical Properties of Community Structure in Large Social and Information.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
DM GROUP MEETING PRESENTATION PLAN Eigenvector-based Centrality Measures For Temporal Networks by D Taylor et.al. Uncovering the Small Community.
Matrix Factorization and its applications By Zachary 16 th Nov, 2010.
CS 590 Term Project Epidemic model on Facebook
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Basic Theory (for curve 01). 1.1 Points and Vectors  Real life methods for constructing curves and surfaces often start with points and vectors, which.
Presented by Alon Levin
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
Spectral Methods for Dimensionality
Segmentation Graph-Theoretic Clustering.
Approximating the Community Structure of the Long Tail
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
3.3 Network-Centric Community Detection
A Block Based MAP Segmentation for Image Compression
Presentation transcript:

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs Zhe Jin

Introduction Given a large phone-call network, how can we find communities of users? Several recent studies have used mobile call graph data to examine and characterize the social interactions of cell phone users The paper is to identify if and to what extent do well-defined social groups of callers exist in such networks

EigenSpokes phenomenon The singular vectors of the Mobile Call graph, when plotted against each other, often have clear separate lines, typically aligned with axes.

Following Questions Cause: What causes these spokes? Ubiquity: Do they occur across varied datasets to be worth studying? Community Extraction: How can we exploit them, to chip off meaningful communities from large graphs?

Related Work Graph partitioning: a popular approach for studying community structure in graphs Spectral clustering, a “cut-based” method for understanding graph structures, which has been successful in machine-learning and image segmentation. Cross-Association: partitions the graph so as to maximize information compression, but is limited to bi-partite structures.

Spokes Singular Value Decomposition (SVD) of an m × n matrix W is a factorization defined as: W = UΣV T, where U and V are m × m and n × n size matrices respectively, and Σ is an m × n diagonal matrix comprised of the singular values. Taking the top K values of Σ yields the best rank-K approximation (w.r.t. the Frobenius norm) to the original matrix.

EigenSpokes EE-plot: the scatter plot of vector Ui and Uj, for any i and j, i.e., they plot one point (Uin, Ujn) for each node n in the graph. EigenSpokes pattern: EE-plots for the Mobile Call graph show clear separate straight lines that are often aligned with axes

EigenSpokes, Connectivity and Communities EE-plots (axis-aligned or not) implies that nodes close to each other on a line have similar scores along two eigenvectors We expect that nodes with similar connectivity will have similar scores along the vectors of U Lemma 1: For any real, symmetric adjacency matrix A, if for any i and j, ∀ k, |h(Ai − Aj ) T, Uki| ≤ ǫ, then ∀ k, |Aik − Ajk| ≤ (ǫ √ N) as well.

Ubiquity of Spokes

Recreating Spokes Want to know exactly which features of graphs and community structure result in spokes using both synthetic and real graphs. Synthetic graphs, in particular, allow us to experiment with various parameters and characteristics, and observe their effect on their EE- plots. It shows that the key factors responsible for these patterns are a large number of well-knit communities embedded in very sparse graphs.

We started with a synthetic random heavy-tailed graph with the same number of nodes and degree distribution as our Mobile Call graph but with no community structure. The EE-plots don’t exhibit any spokes pattern. When we synthetically introduce 40 communities (near-cliques of sizes 31−50, with a probability 0.8 of an intra-community edge) into the above random graph, in Figure 4(b), we observe the emergence of the spokes pattern. When we increase the number of communities to 400, in Figure 4(c), the spoke pattern becomes more clear, and resembles Figure 2.

Results The nodes at the extremities do indeed form the artificially embedded communities. The nature of the communities, including the level of internal connectivity, does not affect the emergence of the spokes pattern as long as such connectivity is significant. Thus we infer that one of the important causes for a spokes pattern is the presence of a large number of tightly knit communities in the graph.

SpokEn: Exploiting EigenSpokes Designing SpokEn: is based on the key property of EigenSpokes, the existence of EigenSpokes indicates the presence of well-knit communities whose nodes have a significant component in that singular vector. A good traversal should select only the nodes which belong to a coherent community. We now discuss where to start the traversal, how to grow the community and finally, when to stop.

Initialization We choose the node with the score of maximum magnitude as the seed for the community. We multiply the given singular vector Ui by −1 if necessary to ensure that the score with the largest magnitude is positive.

Discovery A simple algorithm for discovery is one that picks nodes in decreasing order of their scores. Such an algorithm can pick a node that is disconnected from all the nodes chosen previously. Hence, we propose the following: let C denote the set of all nodes that have been discovered so far; the next node that we select is the node with the largest score that is connected to some node in C. Formally, we augment C with a node n ∗ that satisfies n ∗ = arg maxn ∈ NC Ui(n), where NC is the neighborhood of C 7. This algorithm is intuitive and keeps C always connected.

Termination and Trimming For termination, we need to use a metric that quantifies the quality of the community extracted so far. We propose to use a novel hybrid approach based on conductance [24] for cut and modularity (actually relative modularity) for coherence. The process discovers and adds nodes to the set C as long as the relative modularity increases and terminates once it reduces indicating reduction in community structure. We finally use a conductance based method to trim out the remaining false positives.

Discussion Relative Modularity :In large graphs such as ours, underlying communities are typically small (10 ≈ 100 in a million node-graph). The equation for modularity indicates that when extracting a single small community from a large graph, the modularity metric computed on such a highly unbalanced partition would be dominated by the larger partition and not the discovered community, thus rendering it useless. Trimming using Conductance: conductance as a termination criterion results in premature termination of the discovery process, causing several false negatives while relative modularity as a termination metric often results in overshooting and hence false positives.

Empirical Results conductance as a termination criterion undershot and hence detected fewer communities (about 60% with almost no false positives) modularity discovered about 80% communities but with 4% false positives SpokEn, was able to identify 76-90% of the embedded communities with almost no false positives Speed : The computation time is mainly dominated by the eigenvector calculation which is linear in edges. The processing time is linear in the number of graph edges

Following Questions Cause: What causes these spokes? Ubiquity: Do they occur across varied datasets to be worth studying? Community Extraction: How can we exploit them, to chip off meaningful communities from large graphs?

Conclusions Cause: Spokes can be strongly associated with the presence of well- defined communities like cliques and bi-partite cores in sparse graphs Ubiquity: Apart from Mobile Call graphs, they occur in a variety of datasets such as Patent citations, Dictionary and Internet Community Extraction: The spokes pattern allows us to construct an efficient and scalable algorithm “SpokEn” that helps us chip off communities thereby revealing several interesting structures in Mobile Call graphs as well as the other datasets.

Thank You!!