Community Structure in Large Social and Information Networks Michael W. Mahoney (Joint work at Yahoo with Kevin Lang and Anirban Dasgupta, and also Jure.

Slides:



Advertisements
Similar presentations
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Advertisements

Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Jure Leskovec (Stanford) Kevin Lang (Yahoo! Research) Michael Mahoney (Stanford)
Implementing regularization implicitly via approximate eigenvector computation Michael W. Mahoney Stanford University (Joint work with Lorenzo Orecchia.
Analysis and Modeling of Social Networks Foudalis Ilias.
Geometric Network Analysis Tools Michael W. Mahoney Stanford University MMDS, June 2010 ( For more info, see: cs.stanford.edu/people/mmahoney/
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta and Michael Mahoney Yahoo! Research.
Community Detection and Evaluation
1 Connectivity Structure of Bipartite Graphs via the KNC-Plot Erik Vee joint work with Ravi Kumar, Andrew Tomkins.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
What did we see in the last lecture?. What are we going to talk about today? Generative models for graphs with power-law degree distribution Generative.
SILVIO LATTANZI, D. SIVAKUMAR Affiliation Networks Presented By: Aditi Bhatnagar Under the guidance of: Augustin Chaintreau.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
Looking for clusters in your data ... (in theory and in practice)
V4 Matrix algorithms and graph partitioning
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS Lecture 6 Generative Graph Models Part II.
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
Community Structure in Large Social and Information Networks Michael W. Mahoney Stanford University (Joint work with Kevin Lang and Anirban Dasgupta of.
A scalable multilevel algorithm for community structure detection
Approximate computation and implicit regularization in large-scale data analysis Michael W. Mahoney Stanford University Sept 2011 (For more info, see:
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Community Structure in Large Social and Information Networks Michael W. Mahoney Yahoo Research ( For more info, see:
Approximation Algorithms as “Experimental Probes” of Informatics Graphs Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Extracting insight from large networks: implications of small-scale and large-scale structure Michael W. Mahoney Stanford University ( For more info, see:
Geometric Tools for Identifying Structure in Large Social and Information Networks Michael W. Mahoney Stanford University SAMSI - Aug 2010 ( For more info,
The Relative Vertex-to-Vertex Clustering Value 1 A New Criterion for the Fast Detection of Functional Modules in Protein Interaction Networks Zina Mohamed.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Survey on Evolving Graphs Research Speaker: Chenghui Ren Supervisors: Prof. Ben Kao, Prof. David Cheung 1.
Sensors, networks, and massive data Michael W. Mahoney Stanford University May 2012 ( For more info, see: cs.stanford.edu/people/mmahoney/ or Google.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Community Detection by Modularity Optimization Jooyoung Lee
Weighted Graphs and Disconnected Components Patterns and a Generator IDB Lab 현근수 In KDD 08. Mary McGlohon, Leman Akoglu, Christos Faloutsos.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
Web Science Course Lecture: Social Networks - * Dr. Stefan Siersdorfer 1 * Figures from Easley and Kleinberg 2010 (
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 Burning a graph as a model of social contagion Anthony Bonato Ryerson University Institute of Software Chinese Academy of Sciences.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
On-line Social Networks - Anthony Bonato 1 Dynamic Models of On-Line Social Networks Anthony Bonato Ryerson University WAW’2009 February 13, 2009 nt.
Complex Network Theory – An Introduction Niloy Ganguly.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Complex Network Theory – An Introduction Niloy Ganguly.
Jure Leskovec Kevin J. Lang Anirban Dasgupta Michael W. Mahoney WWW’ 2008 Statistical Properties of Community Structure in Large Social and Information.
University at BuffaloThe State University of New York Detecting Community Structure in Networks.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Miniconference on the Mathematics of Computation
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Groups of vertices and Core-periphery structure
Community detection in graphs
Lecture 13 Network evolution
Community Detection: Overlapping Communities
Statistical properties of network community structure
Lecture 21 Network evolution
Modelling and Searching Networks Lecture 2 – Complex Networks
Affiliation Network Models of Clusters in Networks
What did we see in the last lecture?
Presentation transcript:

Community Structure in Large Social and Information Networks Michael W. Mahoney (Joint work at Yahoo with Kevin Lang and Anirban Dasgupta, and also Jure Leskovec of CMU.) (For more info, see: Workshop on Algorithms for Modern Massive Data Sets - MMDS 2008

Networks and networked data Interaction graph model of networks: Nodes represent “entities” Edges represent “interaction” between pairs of entities Lots of “networked” data!! technological networks – AS, power-grid, road networks biological networks – food-web, protein networks social networks – collaboration networks, friendships information networks – co-citation, blog cross-postings, advertiser-bidded phrase graphs... language networks – semantic networks......

Sponsored (“paid”) Search Text-based ads driven by user query

Sponsored Search Problems Keyword-advertiser graph: – provide new ads – maximize CTR, RPS, advertiser ROI “Community-related” problems: Marketplace depth broadening: find new advertisers for a particular query/submarket Query recommender system: suggest to advertisers new queries that have high probability of clicks Contextual query broadening: broaden the user's query using other context information

Micro-markets in sponsored search 10 million keywords 1.4 Million Advertisers Gambling Sports Sports Gambling Movies Media Sport videos What is the CTR and advertiser ROI of sports gambling keywords? Goal: Find isolated markets/clusters with sufficient money/clicks with sufficient coherence. Ques: Is this even possible?

Clustering and Community Finding Linear (Low-rank) methods If Gaussian, then low-rank space is good. Kernel (non-linear) methods If low-dimensional manifold, then kernels are good Hierarchical methods Top-down and botton-up -- common in the social sciences Graph partitioning methods Define “edge counting” metric -- conductance, expansion, modularity, etc. -- in interaction graph, then optimize! “It is a matter of common experience that communities exist in networks... Although not precisely defined, communities are usually thought of as sets of nodes with better connections amongst its members than with the rest of the world.”

Community Score: Conductance S S’ 7 How community like is a set of nodes? Need a natural intuitive measure: Conductance (normalized cut)  (S) = # edges cut / # edges inside  Small  (S) corresponds to more community-like sets of nodes

Community Score: Conductance Score:  (S) = # edges cut / # edges inside What is “best” community of 5 nodes? 8

Community Score: Conductance Score:  (S) = # edges cut / # edges inside Bad community  =5/6 = 0.83 What is “best” community of 5 nodes? 9

Community Score: Conductance Score:  (S) = # edges cut / # edges inside Better community  =5/6 = 0.83 Bad community  =2/5 = 0.4 What is “best” community of 5 nodes? 10

Community Score: Conductance Score:  (S) = # edges cut / # edges inside Better community  =5/6 = 0.83 Bad community  =2/5 = 0.4 Best community  =2/8 = 0.25 What is “best” community of 5 nodes? 11

Network Community Profile Plot We define: Network community profile (NCP) plot Plot the score of best community of size k Search over all subsets of size k and find best:  (k=5) = 0.25 NCP plot is intractable to compute Use approximation algorithms 12

Widely-studied small social networks Zachary’s karate club Newman’s Network Science

“Low-dimensional” graphs (and expanders) d-dimensional meshes RoadNet-CA

What do large networks look like? Downward sloping NCPP small social networks (validation) “low-dimensional” networks (intuition) hierarchical networks (model building) Natural interpretation in terms of isoperimetry implicit in modeling with low-dimensional spaces, manifolds, k-means, etc. Large social/information networks are very very different We examined more than 70 large social and information networks We developed principled methods to interrogate large networks Previous community work: on small social networks (hundreds, thousands)

Large Social and Information Networks

Probing Large Networks with Approximation Algorithms Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. Spectral - (quadratic approx) - confuses “long paths” with “deep cuts” Multi-commodity flow - (log(n) approx) - difficulty with expanders SDP - (sqrt(log(n)) approx) - best in theory Metis - (multi-resolution for mesh-like graphs) - common in practice X+MQI - post-processing step on, e.g., Spectral of Metis Metis+MQI - best conductance (empirically) Local Spectral - connected and tighter sets (empirically, regularized communities!) We are not interested in partitions per se, but in probing network structure.

Typical example of our findings General relativity collaboration network (4,158 nodes, 13,422 edges) 18 Community size Community score

Large Social and Information Networks LiveJournal Epinions Focus on the red curves (local spectral algorithm) - blue (Metis+Flow), green (Bag of whiskers), and black (randomly rewired network) for consistency and cross-validation.

More large networks Cit-Hep-Th Web-Google AtP-DBLP Gnutella

NCPP: LiveJournal (N=5M, E=43M) Community score Community size Better and better communities Best communities get worse and worse Best community has ≈100 nodes 21

“Whiskers” and the “core” “Whiskers” maximal sub-graph detached from network by removing a single edge contains 40% of nodes and 20% of edges “Core” the rest of the graph, i.e., the 2-edge-connected core Global minimum of NCPP is a whisker NCP plot Largest whisker Slope upward as cut into core

What if the “whiskers” are removed? LiveJournal Epinions Then the lowest conductance sets - the “best” communities - are “2-whiskers.” (So, the “core” peels apart like an onion.)

NCPP for common generative models Preferential Attachment Copying Model RB Hierarchical Geometric PA

A simple theorem on random graphs Power-law random graph with   (2,3). Structure of the G(w) model, with   (2,3). Sparsity (coupled with randomness) is the issue, not heavy-tails. (Power laws with   (2,3) give us the appropriate sparsity.)

A “forest fire” model At each time step, iteratively add edges with a “forest fire” burning mechanism. Model of: Leskovec, Kleinberg, and Faloutsos 2005 Also get “densification” and “shrinking diameters” of real graphs with these parameters (Leskovec et al. 05).

Comparison with “Ground truth” (1 of 2) Networks with “ground truth” communities: LiveJournal12 : users create and explicitly join on-line groups CA-DBLP : publication venues can be viewed as communities AmazonAllProd : each item belongs to one or more hierarchically organized categories, as defined by Amazon AtM-IMDB : countries of production and languages may be viewed as communities (thus every movie belongs to exactly one community and actors belongs to all communities to which movies in which they appeared belong)

Comparison with “Ground truth” (2 of 2) LiveJournal CA-DBLP AmazonAllProd AtM-IMDB

Miscellaneous thoughts... Sociological work on community size (Dunbar and Allen) 150 individuals is maximum community size Military companies, on-line communities, divisions of corporations all ≤ 150 Common bond vs. common identity theory Common bond - people are attached to individual community members Common identity - people are attached to the group as a whole What edges “mean” and community identification social networks - reasons an individual adds a link to a friend very diverse citation networks - links are more “expensive” and semantically uniform.

Conclusions Approximation algorithms as experimental probes! Hard-to-cut onion-like core with more structure than random Small well-isolated communities gradually blend into the core Community structure in large networks is qualitatively different! Agree with previous results on small networks Agree with sociological interpretation (Dunbar’s 150 and bond vs. identity)! Common generative models don’t capture community phenomenon! Graph locality - important for realistic network generation Local regularization - important due to sparsity