WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

Online Social Networks (OSNs)  A network of declared friendships between users, allowing users to maintain relationships  Many popular OSNs with different focus  Facebook, LinkedIn, Flickr, …  Facebook  More than 400 million active users  50% of them log on to Facebook in any given day  Average user has 130 friends  People spend over 500 billion minutes per month on Facebook  more than 100 million mobile users  Mobile user are twice more active than non-mobile users. Social Graph undirected graph G = (V, E) V: nodes (users) E: edges (relationships) k v : node degree

Why Sample OSNs?  Representative samples desirable  study properties  test algorithms  Obtaining complete dataset difficult  companies usually unwilling to share data  tremendous overhead to measure all (~100TB for Facebook)

Problem Statement  Obtain a representative sample of users in a given OSN by exploration of the social graph.  Uniform sample of Facebook users  explore graph using various crawling techniques

Outline  Motivation and Problem Statement  Sampling Methodology  Crawling Methods  Convergence Evaluation  Data Collection  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

Crawling Methods  Crawling Methods  Breadth First Search (BFS)  Random Walk (RW)  Re-Weighted Random Walk (RWRW)  Metropolis-Hastings Random Walk (MHRW)  Uniform Sampling (UNI)

Breadth First Search (BFS)  Early measurement studies of OSNs use BFS as primary sampling technique  Starting from a seed, explores all neighbor nodes.  As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph.  BFS leads to bias towards high degree nodes

Random Walk (RW)  Explores graph one node at a time with replacement  In the stationary distribution  biased towards higher degree nodes ( π v ~ k v ) Degree of node υ Number of edges

Re-Weighted Random Walk (RWRW)  Corrects for degree bias at the end of collection  Without re-weighting, the probability distribution for node property A is: (e.g. the degree, network size...)  Re-Weighted probability distribution : Degree of node u

Metropolis-Hastings Random Walk (MHRW)  Explore graph one node at a time with replacement  In the stationary distribution  Exactly the uniform distribution

Uniform Sampling (UNI)  As a basis for comparison (ground truth)  Rejection sampling  uniform sampling of on the 32-bit IDs  discarding the non-existing ones  yields a uniform sample of the existing user IDs in Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space)  UNI not a general solution for sampling OSNs  userID space must not be sparse  names instead of numbers  must be supported by the systems

Convergence Detection  Number of samples (iterations) to loose dependence from starting points?

Convergence Evaluation  Using Multiple Parallel Walks to improve convergence  avoid getting trapped in certain region  starting from 28 different randomly chosen initial nodes  Detecting Convergence with Online Diagnostics  sampling longer and discard a number of initial “burn-in” iterations Consumed BW (TB) and measurement time (days) Crucial to decide appropriate ‘burn-in’ and total running time  Grweke Diagnostic  Gelman-Rubin Diagnostic

Geweke Diagnostic  Detect the convergence of a single Markov chain  With increasing number of iterations, X a and X b move further apart, which limits the correlation between them.  according to the law of large numbers, the z values become normally distributed ~ (0, 1)  Declare convergence when most values fall in the [-1,1] interval XaXa XbXb

Walk 1 Walk 2 Walk 3 Between walks variance Within walks variance Gelman-Rubin Diagnostic  Detects convergence for m>1 walks (m: # of chains)  Compare the empirical distributions of individual chains with the empirical distribution of all sequences together  if they are similar enough (R,1.02), declare convergence

Data Collection  Information collected

Data Collection  Summary of data set 28 x 81K = 2.26 M 28 initial starting nodes crawl until exactly 81K samples are collected 28 x 81K = 2.26 M 28 initial starting nodes crawl until exactly 81K samples are collected repeat the same node in a walk # of rejected nodes without repetition : 645 K repeat the same node in a walk # of rejected nodes without repetition : 645 K 18.53M nodes picked uniform from [1, 2 32 ] only 1216 K users existed 228 K users had zero friends 18.53M nodes picked uniform from [1, 2 32 ] only 1216 K users existed 228 K users had zero friends RW: 97 % nodes are unique BFS: 97 % nodes are unique confirms that the random seeding chose different areas of FB RW: 97 % nodes are unique BFS: 97 % nodes are unique confirms that the random seeding chose different areas of FB

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Convergence Analysis  Methods Comparison  Unbiased Estimation  Facebook Data Anaylsis  Conclusion

 What is a fair way to compare the results of MHRW with RW and BFS?  MHRW visits fewer unique nodes than RW and BFS MHRW stays at some nodes for relatively long time/iterations Happens usually at some low degree node  An appropriate practical comparison should be based on the number of visited unique nodes Convergence Analysis

Node Degree Convergence Test  When does it reach equilibrium?  Burn-in determined to be 3K -> discard 6K converge when all 28 values fall in the [-1, 1] interval 500 iterations converge when all R scores drop below 1.02 (0,1): not in / in 3000 iterations

Methods Comparison  MHRW, RWRW produce good in estimating the probability of a node degree  The degree distribution will converge fast to a good uniform sample  Poor performance for BFS, RW 28 crawls

Unbiased Estimation (BFS, RW)  Node degree distribution  introduce a strong bias towards the high degree nodes  the low-degree nodes are under-represented

Unbiased Estimation (MHRW)  Degree distribution identical to UNI (MHRW, RWRW)

FB Social Graph – degree distribution  Degree distribution not a power law a 2 =3.38 a 1 =1.32

FB Social Graph - Assortativity  Assortativity  nodes tend to connect to similar or different nodes?  positive correlation: high degree nodes tend to connect to other high degree nodes

FB Social Graph – Privacy Awareness

Conclusion  Compared graph crawling methods  MHRW, RWRW performed remarkably well  BFS, RW lead to substantial bias  Practical recommendations  correct for bias  usage of online convergence diagnostics  proper use of multiple chains

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Similar presentations

Presentation on theme: "WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Similar presentations

Presentation on theme: "WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction."— Presentation transcript:

Similar presentations

About project

Feedback