Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.

Slides:



Advertisements
Similar presentations
Lindsey Bleimes Charlie Garrod Adam Meyerson
Advertisements

AP Statistics Course Review.
Analysis and Modeling of Social Networks Foudalis Ilias.
1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡
Practical Recommendations on Crawling Online Social Networks
Information Networks Small World Networks Lecture 5.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Directional triadic closure and edge deletion mechanism induce asymmetry in directed edge properties.
Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Advanced Topics in Data Mining Special focus: Social Networks.
SDSC, skitter (July 1998) A random graph model for massive graphs William Aiello Fan Chung Graham Lincoln Lu.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
TELCOM2125: Network Science and Analysis
Ramanujan Graphs of Every Degree Adam Marcus (Crisply, Yale) Daniel Spielman (Yale) Nikhil Srivastava (MSR India)
The Very Small World of the Well-connected. (19 june 2008 ) Lada Adamic School of Information University of Michigan Ann Arbor, MI
Konstantin Avrachenkov (INRIA) Prithwish Basu (BBN) Giovanni Neglia (INRIA) Bruno Ribeiro (CMU) Don Towsley (UMass Amherst) 1 K. Avrachenkov, P. Basu,
COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Information Networks Power Laws and Network Models Lecture 3.
A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.
Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.
1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Traceroute-like exploration of unknown networks: a statistical analysis A. Barrat, LPT, Université Paris-Sud, France I. Alvarez-Hamelin (LPT, France) L.
Author: M.E.J. Newman Presenter: Guoliang Liu Date:5/4/2012.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population.
University “Ss. Cyril and Methodus” SKOPJE Cluster-based MDS Algorithm for Nodes Localization in Wireless Sensor Networks Ass. Biljana Stojkoska.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Minas Gjoka, Emily Smith, Carter T. Butts
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Data Structures and Algorithms in Parallel Computing Lecture 7.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Performance Evaluation Lecture 1: Complex Networks Giovanni Neglia INRIA – EPI Maestro 10 December 2012.
Community detection via random walk Draft slides.
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Warsaw Summer School 2014, OSU Study Abroad Program Sampling Distribution.
1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.
Lecture II Introduction to complex networks Santo Fortunato.
Random Walk for Similarity Testing in Complex Networks
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Uncovering the Mystery of Trust in An Online Social Network
Groups of vertices and Core-periphery structure
Approximating the MST Weight in Sublinear Time
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Modeling, sampling, generating Networks with MRV
Community detection in graphs
Network Science: A Short Introduction i3 Workshop
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia

2 sample of Twitter network measure characteristics of social networks in the wild

3 The Fra Mauro world map (1459) “World Map” in 1459 ◦ proved incomplete (Columbus et al. 1492) (Australia 17 th century) ◦ wrong proportions (Africa & Asia) source: Wikipedia

 methods to sample graphs (online social networks)  uniform vertex sampling v.s. uniform edge sampling  random walks  our contribution (Frontier sampling random walk)  results 4

5 random sampling (uniform & independent) crawling  vertex sampling  BFS sampling 5  random walk sampling  edge sampling

 uniform vertex sampling ◦ θ i - fraction of vertices with degree i ◦ vertex with degree i is sampled with probability θ i  uniform edge sampling ◦ π i - probability that a vertex with degree i is sampled ◦ π i = θ i x i /  estimating θ i from π i ( uniform edge) : ◦ trivial to remove bias 6 v v u u

estimate: θ i - fraction of vertices with degree i ; budget: B samples  accuracy metric: Normalized root Mean Squared Error  uniform vertex  uniform edge, 7

8 uniform edge uniform vertex  head: GOOD  tail: BAD  head: GOOD  tail: BAD GOOD  head: BAD  tail: GOOD  head: BAD  tail: GOOD BAD  Flickr graph (1.7 M vertices, 22M edges)  sampling budget: B = |V|/100 samples vertex degree avg. degree

uniform vertex  pros: ◦ independent sampling ◦ OSN needs numeric user IDs. E.g.: Livejournal, Flickr, MySpace, Facebook,...  cons: ◦ resource intensive (sparse user ID space) ◦ difficult to sample large degree vertices 9 uniform edge  pros: ◦ independent sampling ◦ easy to sample large degree vertices  cons: ◦ no public OSN interface to sample edges ◦ difficult to sample small degree vertices

 start at v  randomly selects a neighbor of v ... until B samples  vertices can be sampled multiple times  often (resource-wise) cheaper than uniform vertex sampling  graph should be connected  multiple RWs: m independent walkers to capture  B/m  samples 10

 θ i – fraction of vertices with degree i  P[ sampled degree = i]  π i  in steady state samples edges uniformly (only if graph connected)  RW = uniform edge sampling without independence CCDF RW sampling π i θ i 11 (i) distribution observed by RW true distribution P[X > x] x = degree (log-scale)

uniform vertex  pros: ◦ independent sampling ◦ supported by OSNs with numeric user IDs: Livejournal, Flickr, MySpace, Facebook,...  cons: ◦ resource intensive (sparse user ID space) ◦ difficult to sample large degree vertices 12 uniform edge RW  pros: ◦ independent sampling ◦ easy to sample high degree vertices ◦ resource-wise cheap  cons: ◦ graph must be connected ◦ large estimation errors when graph loosely connected ◦ should start in steady state (discard transient samples, but transient is unknown)

 uniform vertex samples both A and B subgraphs ◦ but is expensive  RW samples either A or B ◦ but is cheap 13 A B combine advantages of uniform vertex & RWs?

 design a RW that in steady state ◦ samples edges uniformly (importance sampling) & ◦ initialize steady state w/ uniform vertex sampling  in steady state we want ◦ to sample vertices proportional to degree ◦ to start with uniformly sampled vertices 14

15

B – sampling budget Let S = {v 1, v 2, …, v m } be a set of m vertices (1) select v r  S w.p.  deg( v r ) (2) walk one step from v r (3) add walked edge to E’ and update v r (4) return to (1) (until m + | E’ | = B ) 16

17 G Frontier sampling  random walk on G m G m = m -th Cartesian power of G  u u j j k k u u j j k k = G  u,u j,u u,k u,j k,u k,k j,j = G2G2 k,j j,k

 when in steady state ( m →  ) FS state at step k : S k =( v 1, v 2,..., v  ) FS state at step k+1 : S k+1 =( v 1, u 2,..., v  ) ◦ samples edges uniformly (like a RW) ◦ m →   number of walkers in v  V is uniformly distributed (uniform vertex sampling) 18 uniform vertex distribution v 2, u 2 chosen proportional to their degrees

19  Flickr: 1.7M vertices, 22M edges  Plot evolution (n), where n = number of steps  4 sample paths = 4 curves

 2 Albert-Barabasi graphs (5x10 5 vertices) w/ avg. deg. 2 and 10 connected by 1 edge 20 AB 2 AB 2 AB 10 AB 10 1 edge

 assortativity => measure of degree correlation between neighboring vertices  global clustering coefficient 21 assortativity: FS consistently better global clustering coefficient: all RWs do well assortativity: FS consistently better global clustering coefficient: all RWs do well

 RW cannot start close to steady state  FS steady state close to uniform vertex sampling (m >> 1) 22

23 Estimating and Sampling Graphs with Multidimensional Random Walks

A RW that can jump to a randomly chosen vertex Improving Random Walk Estimation Accuracy with Uniform Restarts, (with Konstantin Avrachenkov (INRIA) and Don Towsley (UMASS) ), WAW 2010  Shorter mixing time  Smaller estimation error  trivial extension: Frontier sampling with random jumps 24

 centrally coordinated but can be easily made distributed at no cost Distributed FS:  independent multiple RWs  virtual budget B for each walker  at step i reduce budget by X  exp(deg(v i )) ( v i - RW vertex at step i )  no communication cost  virtual budget B  # of samples 25

 large connected component of Flickr graph  estimate complimentary cumulative distribution (CCDF)  metric of accuracy: 26 vertex degree