1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.

Slides:



Advertisements
Similar presentations
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Advertisements

Rarest First and Choke Algorithms are Enough Arnaud LEGOUT INRIA, Sophia Antipolis France G. Urvoy-Keller and P. Michiardi Institut Eurecom France.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
 These 100 seniors make up one possible sample. All seniors in Howard County make up the population.  The sample mean ( ) is and the sample standard.
1 Virtual COMSATS Inferential Statistics Lecture-7 Ossam Chohan Assistant Professor CIIT Abbottabad.
Maximizing the Spread of Influence through a Social Network
1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡
Practical Recommendations on Crawling Online Social Networks
Construction of Simple Graphs with a Target Joint Degree Matrix and Beyond Minas Gjoka, Balint Tillman, Athina Markopoulou University of California, Irvine.
An Analysis of Social Network-Based Sybil Defenses Sybil Defender
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Random Walk on Graph t=0 Random Walk Start from a given node at time 0
Mining and Searching Massive Graphs (Networks)
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.
Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.
Stat 512 – Lecture 12 Two sample comparisons (Ch. 7) Experiments revisited.
Research Study. Type Experimental study A study in which the investigator selects the levels of at least one factor Observational study A design in which.
Searching in Unstructured Networks Joining Theory with P-P2P.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Network A/B Testing: From Sampling to Estimation
Models of Influence in Online Social Networks
Sampling Distributions
Multigraph Sampling of Online Social Networks Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou 1Multigraph sampling.
1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant (UC Irvine) Joint work with: Minas Gjoka (UC Irvine),
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Poking Facebook: Characterization of OSN Applications Minas Gjoka, Michael Sirivianos, Athina Markopoulou, Xiaowei Yang University of California, Irvine.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
ACN: RED paper1 Random Early Detection Gateways for Congestion Avoidance Sally Floyd and Van Jacobson, IEEE Transactions on Networking, Vol.1, No. 4, (Aug.
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
How far removed are you? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL Marcin Nagy joint work with Thanh Bui, Emiliano De.
October Large networks: a new language for science László Lovász Eötvös Loránd University, Budapest
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Efficient Route Computation on Road Networks Based on Hierarchical Communities Qing Song, Xiaofan Wang Department of Automation, Shanghai Jiao Tong University,
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
A Theoretical Framework for Adaptive Collection Designs Jean-François Beaumont, Statistics Canada David Haziza, Université de Montréal International Total.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
Community Detection Algorithms: A Comparative Analysis Authors: A. Lancichinetti and S. Fortunato Presented by: Ravi Tiwari.
Bayesian Travel Time Reliability
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
Minas Gjoka, Emily Smith, Carter T. Butts
Gerhard Haßlinger Search Methods in Dynamic Wireless Networks  Challenges for search in wireless networks  Random walks and flooding for search with.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Random Walk for Similarity Testing in Complex Networks
Greedy & Heuristic algorithms in Influence Maximization
DTMC Applications Ranking Web Pages & Slotted ALOHA
Modeling, sampling, generating Networks with MRV
Uniform Sampling from the Web via Random Walks
Spatial Online Sampling and Aggregation
Effective Social Network Quarantine with Minimal Isolation Costs
Haim Kaplan and Uri Zwick
Noémi Gaskó, Rodica Ioana Lung, Mihai Alexandru Suciu
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Javad Ghaderi, Tianxiong Ji and R. Srikant
Presentation transcript:

1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou University of California, Irvine SIGMETRICS 2011, June 11th, San Jose

2 (over 15% of world’s population, and over 50% of world’s Internet users !) Online Social Networks (OSNs) > 1 billion users October million2 200 million9 130 million million43 75 million10 75 million29 Size Traffic

Facebook: 500+M users 130 friends each (on average) 8 bytes (64 bits) per user ID The raw connectivity data, with no attributes: 500 x 130 x 8B = 520 GB This is neither feasible nor practical. Solution: Sampling! To get this data, one would have to download: 100+ TB of (uncompressed) HTML data! 3

Sampling Topology? What: 4

Sampling Topology? Nodes? What: Directly? How:

Topology? Nodes? What: Directly? Exploration? How: Sampling 6

E.g., Random Walk (RW) Topology? Nodes? What: Directly? Exploration? How: Sampling 7

sampled real Random Walk (RW): Apply the Hansen-Hurwitz estimator: [1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM Real average node degree: 94 Observed average node degree: 338 A Random Walk in Facebook degree of node s

Related Work RW in online graph sampling: WWW [Henzinger et at. 2000, Baykan et al. 2009] P2P [Gkantsidis et al. 2004, Stutzbach et al. 2006, Rasti et al. 2009] OSN [Rasti et al. 2008, Krishnamurthy et al, 2008, Gjoka et al. 2010] RW mixing improvements: Random jumps [Henzinger et al. 2000, Avrachenkov, et al. 2010] Fastest Mixing Markov Chain [Boyd et al. 2004] Multiple dependent walks [Ribeiro et al. 2010] Multigraph Sampling [Gjoka et al. 2011]

What if the nodes are not equally important in our measurement?

Not all nodes are equal irrelevant important (equally) important Node categories: Stratification under Weighted Independence Sampler (WIS) (node size is proportional to its sampling probability) 11

Not all nodes are equal 12 irrelevant important (equally) important Node categories: Stratification under Weighted Independence Sampler (WIS) (node size is proportional to its sampling probability)

Not all nodes are equal But graph exploration techniques have to follow the links! Trade-off between ideal (WIS) sampling weights fast convergence Enforcing WIS weights may lead to slow (or no) convergence 13 Assumption: On sampling a node, we learn categories of its neighbors. irrelevant important (equally) important Node categories: Stratification under Weighted Independence Sampler (WIS) (node size is proportional to its sampling probability) Fastest Mixing Markov Chain [Boyd et al. 2004]

Initialization: Pilot Random Walk

Use classic Random Walk (RW) Pilot Random Walk (RW)

Use classic Random Walk (RW) Collect a list of existing relevant and irrelevant categories Pilot Random Walk (RW)

Use classic Random Walk (RW) Collect a list of existing relevant and irrelevant categories Estimate the relative volume of each category C i : Pilot Random Walk (RW)

Use classic Random Walk (RW) Collect a list of existing relevant and irrelevant categories Estimate the relative volume of each category C i : Pilot Random Walk (RW)

Use classic Random Walk (RW) Collect a list of existing relevant and irrelevant categories Estimate the relative volume of each category C i : 19 Pilot Random Walk (RW) Efficient! No need to visit C i at all! Estimation errors do not bias the ultimate measurement result (but they may increase its variance) RW-based estimator: # of neighbors of u in C i : The size of sample S

Stratified Weighted Random Walk

Measurement objective E.g., compare the size of red and green categories. 21

Measurement objective Category weights optimal under WIS Stratified sampling theory + Information collected by pilot RW E.g., compare the size of red and green categories. 22

Problem 2: “Black holes” Measurement objective Category weights optimal under WIS Modified category weights Problem 1: Poor or no connectivity Solution: Small weight>0 for irrelevant categories. f* -the fraction of time we plan to spend in irrelevant nodes (e.g., 1%) Solution: Limit the weight of tiny relevant categories. Γ - maximal factor by which we can increase edge weights (e.g., 100 times) E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G E.g., compare the size of red and green categories. 20 = vol(green), from pilot RW * Target edge weights: 22 = 4 =

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G 20 = Target edge weights: 22 = 4 = Resolve conflicts: arithmetic mean, geometric mean, max, … E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample E.g., compare the size of red and green categories.

Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample Final result Hansen-Hurwitz estimator E.g., compare the size of red and green categories.

Stratified Weighted Random Walk (S-WRW) Measurement objective Category weights optimal under WIS Modified category weights Edge weights in G WRW sample Final result E.g., compare the size of red and green categories.

Simulation results

weight w NRMSE(size(red)) S-WRW RW WIS Optimal under WIS Tradeoff between fast mixing (~RW) and the weights optimal under Weighted Independence Sampler (WIS) Uniform

weight w NRMSE(size(red)) Simulation results Optimal under WIS The larger the sample size n, the closer to WIS.

Evaluation on Facebook

Colleges in Facebook versions of S-WRW Random Walk (RW) Samples in colleges: 86% of S-WRW, 9% of RW. This is because S-WRW avoids irrelevant categories. The difference is larger (100x) for small colleges. This is due to S-WRW’s stratification. RW discovered 5’325 colleges. S-WRW: 8’815 (not shown)

35 College size estimation RW needs about 14 times more samples to achieve the same error! versions of S-WRW Random Walk (RW) times irrelevant categories stratification 14 ~= 9 x 1.5

Thank you! irrelevant important (equally) important Walking on a Graph with a Magnifying Glass Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou, UC Irvine 36 Facebook datasets available from : Example application:

Parameters f* : the fraction of time we plan to spend in irrelevant nodes: f*=0 iff all nodes relevant, f*>0 otherwise. f*<<1 Exploit the pilot RW information. E.g., f* higher when relevant categories poorly interconnected In Facebook, we used f*=1% Γ>=1 : maximal resolution of our “graph magnifying glass”: Let B be the size of the largest relevant category. S-WRW will typically sample well all categories whose size is at least equal to B / Γ. Think of the smallest category that is still relevant – this gives Γ. Set Γ smaller for smaller sample size. Set Γ smaller in graphs with tight community structure. In Facebook, we set Γ=1000. In the paper, we show that S-WRW is quite robust to the choice of these parameters.

Toy graphs