Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Autonomic Scaling of Cloud Computing Resources
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Analysis and Modeling of Social Networks Foudalis Ilias.
Modeling Malware Spreading Dynamics Michele Garetto (Politecnico di Torino – Italy) Weibo Gong (University of Massachusetts – Amherst – MA) Don Towsley.
1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡
Practical Recommendations on Crawling Online Social Networks
Construction of Simple Graphs with a Target Joint Degree Matrix and Beyond Minas Gjoka, Balint Tillman, Athina Markopoulou University of California, Irvine.
Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.
Walter Willinger AT&T Research Labs Reza Rejaie, Mojtaba Torkjazi, Masoud Valafar University of Oregon Mauro Maggioni Duke University HotMetrics’09, Seattle.
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
Random Walk on Graph t=0 Random Walk Start from a given node at time 0
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Masoud Valafar †, Reza Rejaie †, Walter Willinger ‡ † University of Oregon ‡ AT&T Labs-Research WOSN’09 Barcelona, Spain Beyond Friendship Graphs: A Study.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
On Unbiased Sampling for Unstructured Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield –
Structure based Data De-anonymization of Social Networks and Mobility Traces Shouling Ji, Weiqing Li, and Raheem Beyah Georgia Institute of Technology.
RelSamp: Preserving Application Structure in Sampled Flow Measurements Myungjin Lee, Mohammad Hajjat, Ramana Rao Kompella, Sanjay Rao.
SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.
Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.
1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.
University of California at Santa Barbara Christo Wilson, Bryce Boe, Alessandra Sala, Krishna P. N. Puttaswamy, and Ben Zhao.
WSEAS AIKED, Cambridge, Feature Importance in Bayesian Assessment of Newborn Brain Maturity from EEG Livia Jakaite, Vitaly Schetinin and Carsten.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Poking Facebook: Characterization of OSN Applications Minas Gjoka, Michael Sirivianos, Athina Markopoulou, Xiaowei Yang University of California, Irvine.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Tag Ranking Present by Jie Xiao Dept. of Computer Science Univ. of Texas at San Antonio.
1 Passive Network Tomography Using Bayesian Inference Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang Microsoft Research Internet Measurement.
How far removed are you? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL Marcin Nagy joint work with Thanh Bui, Emiliano De.
Emergence of Scaling and Assortative Mixing by Altruism Li Ping The Hong Kong PolyU
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
1 Presented by: Yuchen Bian MRWC: Clustering based on Multiple Random Walks Chain.
Markov Cluster (MCL) algorithm Stijn van Dongen.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi.
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
Reducing MCMC Computational Cost With a Two Layered Bayesian Approach
Minas Gjoka, Emily Smith, Carter T. Butts
Stefanos Antaris Distributed Publish/Subscribe Notification System for Online Social Networks Stefanos Antaris *, Sarunas Girdzijauskas † George Pallis.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.
1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
Uncovering the Mystery of Trust in An Online Social Network
Advanced Statistical Computing Fall 2016
Sequential Algorithms for Generating Random Graphs
CPSC 531: System Modeling and Simulation
Statistical Methods Carey Williamson Department of Computer Science
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
The Most General Markov Substitution Model on an Unrooted Tree
Carey Williamson Department of Computer Science University of Calgary
GANG: Detecting Fraudulent Users in OSNs
Random Walk on Graph t=0 Random Walk Start from a given node at time 0
Presentation transcript:

Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou UC Irvine, EPFL ‡

Minas Gjoka, UC IrvineWalking in Facebook 2 Outline Motivation and Problem Statement Sampling Methodology Data Analysis Conclusion

Minas Gjoka, UC IrvineWalking in Facebook 3 Online Social Networks (OSNs) A network of declared friendships between users Allows users to maintain relationships Many popular OSNs with different focus – Facebook, LinkedIn, Flickr, … Social Graph

Minas Gjoka, UC IrvineWalking in Facebook 4 Why Sample OSNs? Representative samples desirable – study properties – test algorithms Obtaining complete dataset difficult – companies usually unwilling to share data – tremendous overhead to measure all (~100TB for Facebook)

Minas Gjoka, UC IrvineWalking in Facebook 5 Problem statement Obtain a representative sample of users in a given OSN by exploration of the social graph. – in this work we sample Facebook (FB) – explore graph using various crawling techniques

Minas Gjoka, UC IrvineWalking in Facebook 6 Related Work Graph traversal (BFS) – A. Mislove et al, IMC 2007 – Y. Ahn et al, WWW 2007 – C. Wilson, Eurosys 2009 Random walks (MHRW, RDS) – M. Henzinger et al, WWW 2000 – D. Stutbach et al, IMC 2006 – A. Rasti et al, Mini Infocom 2009

Minas Gjoka, UC IrvineWalking in Facebook 7 Our Contributions Compare various crawling techniques in FB’s social graph – Breadth-First-Search (BFS) – Random Walk (RW) – Metropolis-Hastings Random Walk (MHRW) Practical recommendations – online convergence diagnostic tests – proper use of multiple parallel chains – methods that perform better and tradeoffs Uniform sample of Facebook users – collection and analysis – made available to researchers

Minas Gjoka, UC IrvineWalking in Facebook 8 Outline Motivation and Problem Statement Sampling Methodology – crawling methods – data collection – convergence evaluation – method comparisons Data Analysis Conclusion

Minas Gjoka, UC IrvineWalking in Facebook 9 (1) Breadth-First-Search (BFS) Starting from a seed, explores all neighbor nodes. Process continues iteratively without replacement. BFS leads to bias towards high degree nodes Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006 Early measurement studies of OSNs use BFS as primary sampling technique i.e [Mislove et al], [Ahn et al], [Wilson et al.]

Minas Gjoka, UC IrvineWalking in Facebook 10 (2) Random Walk (RW) Explores graph one node at a time with replacement In the stationary distribution Degree of node υ Number of edges

Minas Gjoka, UC IrvineWalking in Facebook 11 (3) Re-Weighted Random Walk (RWRW) Hansen-Hurwitz estimator Corrects for degree bias at the end of collection Without re-weighting, the probability distribution for node property A is: Re-Weighted probability distribution : Subset of sampled nodes with value i All sampled nodes Degree of node u

Minas Gjoka, UC IrvineWalking in Facebook 12 (4) Metropolis-Hastings Random Walk (MHRW) Explore graph one node at a time with replacement In the stationary distribution

Minas Gjoka, UC IrvineWalking in Facebook 13 Uniform userID Sampling (UNI) As a basis for comparison, we collect a uniform sample of Facebook userIDs (UNI) – rejection sampling on the 32-bit userID space UNI not a general solution for sampling OSNs – userID space must not be sparse – names instead of numbers

Minas Gjoka, UC IrvineWalking in Facebook 14 Summary of Datasets Sampling methodMHRWRWBFSUNI #Valid Users28x81K 984K # Unique Users957K2.19M2.20M984K Egonets for a subsample of MHRW - local properties of nodes Datasets available at:

Minas Gjoka, UC IrvineWalking in Facebook 15 Data Collection Basic Node Information What information do we collect for each sampled node u ?

Minas Gjoka, UC IrvineWalking in Facebook 16 Detecting Convergence Number of samples (iterations) to loose dependence from starting points?

Minas Gjoka, UC IrvineWalking in Facebook 17 Online Convergence Diagnostics Geweke Detects convergence for a single walk. Let X be a sequence of samples for metric of interest. J. Geweke, “Evaluating the accuracy of sampling based approaches to calculate posterior moments“ in Bayesian Statistics 4, 1992 XaXa XbXb

Minas Gjoka, UC IrvineWalking in Facebook 18 Online Convergence Diagnostics Gelman-Rubin Detects convergence for m>1 walks A. Gelman, D. Rubin, “Inference from iterative simulation using multiple sequences“ in Statistical Science Volume 7, 1992 Walk 1 Walk 2 Walk 3 Between walks variance Within walks variance

Minas Gjoka, UC IrvineWalking in Facebook 19 When do we reach equilibrium? Burn-in determined to be 3K Node Degree

Minas Gjoka, UC IrvineWalking in Facebook 20 Methods Comparison Node Degree Poor performance for BFS, RW MHRW, RWRW produce good estimates – per chain – overall 28 crawls

Minas Gjoka, UC IrvineWalking in Facebook 21 Sampling Bias BFS Low degree nodes under-represented by two orders of magnitude BFS is biased

Minas Gjoka, UC IrvineWalking in Facebook 22 Sampling Bias MHRW, RW, RWRW Degree distribution identical to UNI (MHRW,RWRW) RW as biased as BFS but with smaller variance in each walk

Minas Gjoka, UC IrvineWalking in Facebook 23 Practical Recommendations for Sampling Methods Use MHRW or RWRW. Do not use BFS, RW. Use formal convergence diagnostics – assess convergence online – use multiple parallel walks MHRW vs RWRW – RWRW slightly better performance – MHRW provides a “ready-to-use” sample

Minas Gjoka, UC IrvineWalking in Facebook 24 Outline Motivation and Problem Statement Sampling Methodology Data Analysis Conclusion

Minas Gjoka, UC IrvineWalking in Facebook 25 FB Social Graph Degree Distribution Degree distribution not a power law a 2 =3.38 a 1 =1.32

Minas Gjoka, UC IrvineWalking in Facebook 26 FB Social Graph Topological Characteristics Our MHRW sample – Assortativity Coefficient = – range of Clustering Coefficient [0.05, 0.35] [Wilson et al, Eurosys 09] – Assortativity Coefficient = 0.17 – range of Clustering Coefficient [0.05, 0.18] More details in our paper and technical report

Minas Gjoka, UC IrvineWalking in Facebook 27 Conclusion Compared graph crawling methods – MHRW, RWRW performed remarkably well – BFS, RW lead to substantial bias Practical recommendations – correct for bias – usage of online convergence diagnostics – proper use of multiple chains Datasets publicly available –

Minas Gjoka, UC IrvineWalking in Facebook 28 Thank you Questions?