WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Analysis and Modeling of Social Networks Foudalis Ilias.
What you want is not what you get: Predicting sharing policies for text-based content on Facebook Arunesh Sinha*, Yan Li †, Lujo Bauer* *Carnegie Mellon.
Practical Recommendations on Crawling Online Social Networks
CHAPTER 16 MARKOV CHAIN MONTE CARLO
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
BAYESIAN INFERENCE Sampling techniques
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Random Walk on Graph t=0 Random Walk Start from a given node at time 0
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Influence and Correlation in Social Networks Aris Anagnostopoulos Ravi Kumar Mohammad Mahdian.
13. The Weak Law and the Strong Law of Large Numbers
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
On Unbiased Sampling for Unstructured Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield –
Searching in Unstructured Networks Joining Theory with P-P2P.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.
1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.
1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
How far removed are you? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL Marcin Nagy joint work with Thanh Bui, Emiliano De.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Many random walks are faster than one Noga AlonTel Aviv University Chen AvinBen Gurion University Michal KouckyCzech Academy of Sciences Gady KozmaWeizmann.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi.
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
Minas Gjoka, Emily Smith, Carter T. Butts
Local Search. Systematic versus local search u Systematic search  Breadth-first, depth-first, IDDFS, A*, IDA*, etc  Keep one or more paths in memory.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
1 “Hybrid Search Schemes for Unstructured Peer- to-Peer Networks” “Random Walks in Peer-to-Peer Networks” Christos Gkantsidis, Milena Mihail, Amin Saberi.
Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
Uncovering the Mystery of Trust in An Online Social Network
Sequential Algorithms for Generating Random Graphs
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
DTMC Applications Ranking Web Pages & Slotted ALOHA
CPSC 531: System Modeling and Simulation
Uniform Sampling from the Web via Random Walks
Community detection in graphs
Bayesian inference Presented by Amir Hadadi
Spatial Online Sampling and Aggregation
Statistical Methods Carey Williamson Department of Computer Science
Scaling up Link Prediction with Ensembles
Carey Williamson Department of Computer Science University of Calgary
Locality In Distributed Graph Algorithms
Presentation transcript:

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

Online Social Networks (OSNs)  A network of declared friendships between users, allowing users to maintain relationships  Many popular OSNs with different focus  Facebook, LinkedIn, Flickr, …  Facebook  More than 400 million active users  50% of them log on to Facebook in any given day  Average user has 130 friends  People spend over 500 billion minutes per month on Facebook  more than 100 million mobile users  Mobile user are twice more active than non-mobile users. Social Graph undirected graph G = (V, E) V: nodes (users) E: edges (relationships) k v : node degree

Why Sample OSNs?  Representative samples desirable  study properties  test algorithms  Obtaining complete dataset difficult  companies usually unwilling to share data  tremendous overhead to measure all (~100TB for Facebook)

Problem Statement  Obtain a representative sample of users in a given OSN by exploration of the social graph.  Uniform sample of Facebook users  explore graph using various crawling techniques

Outline  Motivation and Problem Statement  Sampling Methodology  Crawling Methods  Convergence Evaluation  Data Collection  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

Crawling Methods  Crawling Methods  Breadth First Search (BFS)  Random Walk (RW)  Re-Weighted Random Walk (RWRW)  Metropolis-Hastings Random Walk (MHRW)  Uniform Sampling (UNI)

Breadth First Search (BFS)  Early measurement studies of OSNs use BFS as primary sampling technique  Starting from a seed, explores all neighbor nodes.  As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph.  BFS leads to bias towards high degree nodes

Random Walk (RW)  Explores graph one node at a time with replacement  In the stationary distribution  biased towards higher degree nodes ( π v ~ k v ) Degree of node υ Number of edges

Re-Weighted Random Walk (RWRW)  Corrects for degree bias at the end of collection  Without re-weighting, the probability distribution for node property A is: (e.g. the degree, network size...)  Re-Weighted probability distribution : Degree of node u

Metropolis-Hastings Random Walk (MHRW)  Explore graph one node at a time with replacement  In the stationary distribution  Exactly the uniform distribution

Uniform Sampling (UNI)  As a basis for comparison (ground truth)  Rejection sampling  uniform sampling of on the 32-bit IDs  discarding the non-existing ones  yields a uniform sample of the existing user IDs in Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space)  UNI not a general solution for sampling OSNs  userID space must not be sparse  names instead of numbers  must be supported by the systems

Convergence Detection  Number of samples (iterations) to loose dependence from starting points?

Convergence Evaluation  Using Multiple Parallel Walks to improve convergence  avoid getting trapped in certain region  starting from 28 different randomly chosen initial nodes  Detecting Convergence with Online Diagnostics  sampling longer and discard a number of initial “burn-in” iterations Consumed BW (TB) and measurement time (days) Crucial to decide appropriate ‘burn-in’ and total running time  Grweke Diagnostic  Gelman-Rubin Diagnostic

Geweke Diagnostic  Detect the convergence of a single Markov chain  With increasing number of iterations, X a and X b move further apart, which limits the correlation between them.  according to the law of large numbers, the z values become normally distributed ~ (0, 1)  Declare convergence when most values fall in the [-1,1] interval XaXa XbXb

Walk 1 Walk 2 Walk 3 Between walks variance Within walks variance Gelman-Rubin Diagnostic  Detects convergence for m>1 walks (m: # of chains)  Compare the empirical distributions of individual chains with the empirical distribution of all sequences together  if they are similar enough (R,1.02), declare convergence

Data Collection  Information collected

Data Collection  Summary of data set 28 x 81K = 2.26 M 28 initial starting nodes crawl until exactly 81K samples are collected 28 x 81K = 2.26 M 28 initial starting nodes crawl until exactly 81K samples are collected repeat the same node in a walk # of rejected nodes without repetition : 645 K repeat the same node in a walk # of rejected nodes without repetition : 645 K 18.53M nodes picked uniform from [1, 2 32 ] only 1216 K users existed 228 K users had zero friends 18.53M nodes picked uniform from [1, 2 32 ] only 1216 K users existed 228 K users had zero friends RW: 97 % nodes are unique BFS: 97 % nodes are unique confirms that the random seeding chose different areas of FB RW: 97 % nodes are unique BFS: 97 % nodes are unique confirms that the random seeding chose different areas of FB

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Convergence Analysis  Methods Comparison  Unbiased Estimation  Facebook Data Anaylsis  Conclusion

 What is a fair way to compare the results of MHRW with RW and BFS?  MHRW visits fewer unique nodes than RW and BFS MHRW stays at some nodes for relatively long time/iterations Happens usually at some low degree node  An appropriate practical comparison should be based on the number of visited unique nodes Convergence Analysis

Node Degree Convergence Test  When does it reach equilibrium?  Burn-in determined to be 3K -> discard 6K converge when all 28 values fall in the [-1, 1] interval 500 iterations converge when all R scores drop below 1.02 (0,1): not in / in 3000 iterations

Methods Comparison  MHRW, RWRW produce good in estimating the probability of a node degree  The degree distribution will converge fast to a good uniform sample  Poor performance for BFS, RW 28 crawls

Unbiased Estimation (BFS, RW)  Node degree distribution  introduce a strong bias towards the high degree nodes  the low-degree nodes are under-represented

Unbiased Estimation (MHRW)  Degree distribution identical to UNI (MHRW, RWRW)

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

FB Social Graph – degree distribution  Degree distribution not a power law a 2 =3.38 a 1 =1.32

FB Social Graph - Assortativity  Assortativity  nodes tend to connect to similar or different nodes?  positive correlation: high degree nodes tend to connect to other high degree nodes

FB Social Graph – Privacy Awareness

Outline  Motivation and Problem Statement  Sampling Methodology  Evaluation of Sampling Techniques  Facebook Data Analysis  Conclusion

Conclusion  Compared graph crawling methods  MHRW, RWRW performed remarkably well  BFS, RW lead to substantial bias  Practical recommendations  correct for bias  usage of online convergence diagnostics  proper use of multiple chains