Modeling, sampling, generating Networks with MRV

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Week 5 - Models of Complex Networks I Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
Unit 7 Section 6.1.
Outline input analysis input analyzer of ARENA parameter estimation
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Chapter 8 Estimating Single Population Parameters
Identifying Early Buyers from Purchase Data Paat Rusmevichientong, Shenghuo Zhu & David Selinger Presented by: Vinita Shinde Feb 18 th, 2010.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.
Advanced Topics in Data Mining Special focus: Social Networks.
Modeling spatially-correlated sensor network data Apoorva Jindal, Konstantinos Psounis Department of Electrical Engineering-Systems University of Southern.
Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects two different vertices. Edges are.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
References for M/G/1 Input Process
Graph. Data Structures Linear data structures: –Array, linked list, stack, queue Non linear data structures: –Tree, binary tree, graph and digraph.
Traffic Modeling.
Topic 5 Statistical inference: point and interval estimate
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Probabilistic Mechanism Analysis. Outline Uncertainty in mechanisms Why consider uncertainty Basics of uncertainty Probabilistic mechanism analysis Examples.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Data Structures & Algorithms Graphs
Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Lecture 2: Statistical learning primer for biologists
Data Structures & Algorithms Graphs Richard Newman based on book by R. Sedgewick and slides by S. Sahni.
Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects two different vertices. Edges are.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Sampling Theory and Some Important Sampling Distributions.
1 Estimation Chapter Introduction Statistical inference is the process by which we acquire information about populations from samples. There are.
Chapter Confidence Intervals 1 of 31 6  2012 Pearson Education, Inc. All rights reserved.
Section 6.1 Confidence Intervals for the Mean (Large Samples) © 2012 Pearson Education, Inc. All rights reserved. 1 of 83.
6.1 Confidence Intervals for the Mean (Large Samples) Prob & Stats Mrs. O’Toole.
Estimating standard error using bootstrap
Random Walk for Similarity Testing in Complex Networks
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Normal Distribution and Parameter Estimation
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Chapter 6 Confidence Intervals.
Competition under Cumulative Advantage
Learning to Generate Networks
Minimum Spanning Tree 8/7/2018 4:26 AM
Generative Model To Construct Blog and Post Networks In Blogosphere
CS200: Algorithm Analysis
Confidence Intervals for Proportions
Haim Kaplan and Uri Zwick
Models of Network Formation
The likelihood of linking to a popular website is higher
Models of Network Formation
Models of Network Formation
Chapter 6 Confidence Intervals.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Models of Network Formation
Competition under Cumulative Advantage
Confidence Intervals for Proportions
Lecture 6: Counting triangles Dynamic graphs & sampling
Determination of Sample Size
Confidence Intervals for the Mean (Large Samples)
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
Network Models Michael Goodrich Some slides adapted from:
Graphs G = (V,E) V is the vertex set.
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Modeling, sampling, generating Networks with MRV Umass Team

Agenda sampling and estimation of network degree distributions – Don effect of MRV on network characteristics – Bo generative models for MRV - Shan

Characterizing Joint Degree Distribution of Directed Networks Fabricio Murai, Don Towsley Umass-Amherst

Outline RW-based estimation sensitivity of accuracy to hidden incoming edges visible incoming edges sensitivity of accuracy to heavy tails reciprocity correlation

Directed networks: hidden edges outgoing edges visible, incoming edges hidden random walk (RW) based estimation (INFOCOM’12) during walk, construct undirected graph consistent with walk follow outgoing edge on first visit walk new undirected graph on revisits use known solution for undirected RW to characterize network can estimate outgoing degree distribution impossibility result for joint in/outdegree distribution lack of statistical information wo sampling most of graph

RW-based: Visible Edges transform digraph to undirected graph collect samples using RW 𝑠 1 , 𝑠 2 ,…, 𝑠 𝑛 , 𝑠 𝑘 =( 𝑖 𝑘 , 𝑜 𝑘 ) estimate 𝜑 𝑖,𝑗 = 1 𝑛 𝑘 ℎ 𝑖𝑗 ( 𝑠 𝑘 ) 𝜋 ( 𝑠 𝑘 ) ℎ 𝑖𝑗 ( 𝑠 𝑘 ) , 𝑖,𝑗=0,1,… ℎ 𝑖𝑗 𝑠 𝑘 = 1, 𝑖 𝑘 =𝑖, 𝑜 𝑘 =𝑗 0, otherwise 𝜋 𝑠 𝑘 =𝐶×deg( 𝑠 𝑘 ) deg( 𝑠 𝑘 ) is degree of new undirected graph, 𝐶 chosen to make 𝜋 a distribution

RW-based degree distribution estimation performance on real datasets vs. uniform vertex sampling vs. DURW for marginal outdegree distribution effect of heavy tails? effect of reciprocity, correlation?

Real Datasets in/out degrees appear to be heavy-tailed Don, The numbers here are not perfectly accurate, since I’m taking the Giant CC. I’ll try to get the right numbers.

Behavior of RW on real datasets YouTube NMRSE sampling budget 10% of graph size (B = 0.1) NRMSE indegree outdegree log(PMF) indegree outdegree log(NRMSE) Simulation results. NRMSE: behaves as expected. Two forces: High probability mass causes upsampling – small error for low total degree; high total degree causes upsampling – errors decrease with total indegree. Middle left area: small pmf, small total degree.

RW vs. Vertex Sampling YouTube, log(NRMSE) with B=0.1 Vertex Sampling indegree outdegree Simulation results. B=0.1 Assuming jump cost=10; hence Vertex sampling only has 1/10 samples. Vertex sampling performs as well as RW for very low total degree; worse everywhere elese

Web-Google B=0.1 log(PMF) log(NRMSE) outdegree outdegree indegree Here the in-degree spans one order of magnitude more. (web pages can’t have too many out-links)

Wiki-Talk B=0.1 log(NRMSE) log(PMF) outdegree outdegree indegree Here the out-degree spans one order of magnitude more.

Out-degree distribution estimation: DURW vs. RW correlation: 0.95 reciprocity: 0.79 RW based on all edges provides (slightly) lower errors Misusing indegree information? NRMSE YouTube: high reciprocity, high correlation High correlation -> both methods exhibit the same trends High reciprocity -> DURW “sees” graph as RW outdegree

Other datasets Correlation: 0.13 Correlation: 0.47 Reciprocity: 0.31 outdegree NRMSE Correlation: 0.47 Reciprocity: 0.14 outdegree NRMSE WikiTalk: medium correlation -> trends are different in the middle low reciprocity -> NRMSE is one scale of magnitude smaller for RW Web-Google: low correlation -> nodes with large out-degree have small in-degree and end-up downsampled with RW, thus, larger error Medium reciprocity -> NRMSEs have about the same order of magnitude

Sensitivity of RW method to graph structure reciprocity correlation Methodology: approximate RW sampling by independent edge sampling focus on joint distributions with pareto(3) marginals

Approximating RW by edge sampling indegree outdegree RW sampling 𝑓 𝑖,𝑗,𝑟 =𝑃(𝑖𝑛=𝑖, 𝑜𝑢𝑡=𝑗, 𝑟𝑒𝑐𝑖𝑝=𝑟) select random edge select random endpoint prob. sample node with indeg i, outdeg j, recip deg r: produces unbiased estimate experiments suggest above expression sufficiently accurate indegree outdegree Edge sampling

Controlling reciprocity Goal: given degree distribution 𝜃 𝑖𝑗 , generate 𝜃 𝑖𝑗𝑟 =𝑃(𝑖𝑛=𝑖, 𝑜𝑢𝑡=𝑗, 𝑟𝑒𝑐𝑖𝑝=𝑟) with arbitrary amount of reciprocity Model given in-degree i, out-degree j, number of reciprocated edges is 𝑟~𝐵𝑖𝑛𝑜𝑚 min 𝑖,𝑗 ,𝑝 𝑝 – tuning parameter, min reciprocity (𝑝=0), max reciprocity (𝑝=1) distr. may not be graphical

Controlling correlation Goal: given marginal distribution 𝑔 𝑖 , 𝑖=0,1,… generate 𝜃 𝑖𝑗 s.t. 𝜃 𝑖∗ = 𝜃 ∗𝑗 = 𝑔 𝑖 with arbitrary nonnegative Neyman-Pearson correlation Model mixture of perfect correlation 𝑓 1 𝑖,𝑗 = 𝑔 𝑖, 𝑖=𝑗 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 no correlation 𝑓 2 𝑖,𝑗 = 𝑔 𝑖 ×𝑔 𝑗 𝜃 𝑖𝑗 =α 𝑓 1 𝑖,𝑗 +(1−α) 𝑓 2 𝑖,𝑗 𝛼 – tuning parameter

Pareto: effect of reciprocity Correlation, 𝛼=0.99 Pareto: effect of reciprocity indegree outdegree log(NRMSE(recip(0)/NRMSE(recip(1))) Q: effect of reciprocity on RW estimation efficiency 𝜃 𝑖𝑗 𝑝 - in/outdegree distribution under 𝑝-reciprocity model focus on 𝑁𝑅𝑀𝑆𝐸( 𝜃 𝑖𝑗 0 ) 𝑁𝑅𝑀𝑆𝐸( 𝜃 𝑖𝑗 1 ) reciprocity hurts estimation when degrees correlated diagonal estimation helped when no correlation recip(0) better No correlation, 𝛼=0 indegree outdegree log(NRMSE(recip(0)/NRMSE(recip(1))) When reciprocity is maximum, the proportion of samples along the diagonal does not change. Hence, log-ratio is zero. recip(0) better recip(1) better

Pareto: effect of correlation indegree outdegree log(NRMSE(corr(0)/NRMSE(corr(.99))) Q: how does correlation affect efficiency of RW-based estimation 𝜃 𝑖𝑗 𝛼 - in/outdegree distribution under 𝛼-correlation model focus on 𝑁𝑅𝑀𝑆𝐸( 𝜃 𝑖𝑗 0 ) 𝑁𝑅𝑀𝑆𝐸( 𝜃 𝑖𝑗 0.99 ) increased correlation -> increased accuracy on diagonal indegree = outdegree log(NRMSE(corr(0)/NRMSE(corr(.99))) When reciprocity is maximum, the proportion of samples along the diagonal does not change. Hence, log-ratio is zero.

Capturing MRV characteristic in all of above Roadmap rewiring algorithms to generate arbitrary reciprocity, correlation in real networks more flexible correlation model (input covariance matrix, possibly different marginals) evaluation on networks other network structural properties Capturing MRV characteristic in all of above