A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi.

Slides:

Advertisements

Similar presentations

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.

Advertisements

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Power Laws By Cameron Megaw 3/11/2013. What is a Power Law?

Analysis and Modeling of Social Networks Foudalis Ilias.

Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡

Practical Recommendations on Crawling Online Social Networks

Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.

1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.

Directional triadic closure and edge deletion mechanism induce asymmetry in directed edge properties.

Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.

Mining and Searching Massive Graphs (Networks)

Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.

CMU SCS KDD 2006Leskovec & Faloutsos1 ??. CMU SCS KDD 2006Leskovec & Faloutsos2 Sampling from Large Graphs poster# 305 Jurij (Jure) Leskovec Christos.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

A New Force-Directed Graph Drawing Method Based on Edge- Edge Repulsion Chun-Cheng Lin and Hsu-Chen Yen Department of Electrical Engineering, National.

Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),

Applied Geostatistics Geostatistical techniques are designed to evaluate the spatial structure of a variable, or the relationship between a value measured.

Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

1 Data Structures and Algorithms Graphs I: Representation and Search Gal A. Kaminka Computer Science Department.

Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.

The Impact of Spatial Correlation on Routing with Compression in WSN Sundeep Pattem, Bhaskar Krishnamachri, Ramesh Govindan University of Southern California.

Advanced Topics in Data Mining Special focus: Social Networks.

Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.

Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,

On Distinguishing between Internet Power Law B Bu and Towsley Infocom 2002 Presented by.

CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,

Models of Influence in Online Social Networks

Exploiting indirect neighbors and topological weight to predict protein function from protein– protein interactions Hon Nian Chua, Wing-Kin Sung and Limsoon.

Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial

Traceroute-like exploration of unknown networks: a statistical analysis A. Barrat, LPT, Université Paris-Sud, France I. Alvarez-Hamelin (LPT, France) L.

Efficient Gathering of Correlated Data in Sensor Networks

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.

Rate-based Data Propagation in Sensor Networks Gurdip Singh and Sandeep Pujar Computing and Information Sciences Sanjoy Das Electrical and Computer Engineering.

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.

Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

A Graph-based Friend Recommendation System Using Genetic Algorithm

 Building Networks. First Decisions  What do the nodes represent?  What do the edges represent?  Know this before doing anything with data!

Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.

A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.

BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

Click to Edit Talk Title USMA Network Science Center Specific Communication Network Measure Distribution Estimation Daniel.

J. Hwang, T. He, Y. Kim Presented by Shan Gao. Introduction  Target the scenarios where attackers announce phantom nodes.  Phantom node  Fake their.

University “Ss. Cyril and Methodus” SKOPJE Cluster-based MDS Algorithm for Nodes Localization in Wireless Sensor Networks Ass. Biljana Stojkoska.

KAIS T On the problem of placing Mobility Anchor Points in Wireless Mesh Networks Lei Wu & Bjorn Lanfeldt, Wireless Mesh Community Networks Workshop, 2006.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.

Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm CS 6/73201 Advanced Operating System Presentation by: Sanjitkumar Patel.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

A configuration method for structured P2P overlay network considering delay variations Tomoya KITANI (Shizuoka Univ. 、 Japan) Yoshitaka NAKAMURA (NAIST,

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Models of Web-Like Graphs: Integrated Approach

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

Energy Models for Graph Clustering Bo-Young Kim Applied Algorithm Lab, KAIST.

1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.

1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.

GIS and Spatial Analysis1 Summary  Parametric Test  Interval/ratio data  Based on normal distribution  Difference in Variances  Differences are just.

Uncovering the Mystery of Trust in An Online Social Network

Discrete ABC Based on Similarity for GCP

Topology Control –power control

Distributed voting application for handheld devices

Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.

Gephi Gephi is a tool for exploring and understanding graphs. Like Photoshop (but for graphs), the user interacts with the representation, manipulate the.

Department of Computer Science University of York

Locality In Distributed Graph Algorithms

Presentation transcript:

A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi State University 2 Pacific Northwest National Laboratory

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Introduction  Motivation  Visualization  To display all nodes and edges is impossible.  Estimation or calculation  To calculate graph properties on a large graph is costly.  Data  No complete data.  To obtain data is time-consuming.

Introduction  Problems  Given a huge graph, how to get a representative sample?  Given several sampling methods, which sampling method is best?  How to compare those sampling results?  How do we measure success?

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Evaluation  Kolmogorov-Smirnov (KS) D-statistic  D n = sup x |F n (x)-F(x)| sup x : supremum of the set of distance. F n and F are distribution function.  To evaluate the similarity between original and sampled graph  To calculate KS value on graph properties.

Evaluation Graph properties(10): Directed Graph  In-degree distribution (InDD)  Out-degree distribution (OutDD)  Betweenness centrality distribution (BCB)  Average neighbor degree distribution (ANDD)  In-degree centrality distribution (InDCD)  Out-degree centrality distribution (OutDCD)  Edge betweenness centrality distribution (EBCD)  Weakly connected component distribution(WCCD)  Hops distribution (HD)  Hops distribution in largest weakly connected component (HLCCD Graph properties(7): Undirected Graph  Degree distribution (DD)  Betweenness centrality distribution (BB)  Clustering coefficient (CCD)  Average neighbor degree distribution (ANDD)  Degree centrality distribution (DCD)  Edge betweenness centrality distribution (EBCD)  Hop distribution (HD)

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Sampling algorithms  Node Sampling  Random node (RN)  Random node-edge (RNE)  Random node-neighbour (RNN)  Streaming nodes (SN)  Edge Sampling  Random edge (RE)  Induced edge (IE)  Streaming edge (SE)  Topology Based Sampling  Breadth-first (BF)  Depth-first (DF)  Random first (RF)  Snowball (SB)  Random walk (RW)  Random walk with escape (RWE)  Forest fire (FF)

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Comparison of Sampling Results Undirected Graph  American Airlines connection data  235 nodes 1297 edges  Simulated random undirected Graph  Graph 1: 500 nodes 1260 edges  Graph 2: 500 nodes 1238 edges  Graph 3: 750 nodes 2871 edges  Graph 4: 1000 nodes 4989 edges  Graph 5: 1000 nodes 5092 edges  Graph 6: 1250 nodes 7884 edges Directed Graph  VAST  1214 nodes edges  Simulated random directed Graph  Graph 1: 500 nodes 1284 edges  Graph 2: 500 nodes 1262 edges  Graph 3: 750 nodes 2859 edges  Graph 4: 1000 nodes 4900 edges  Graph 5: 1000 nodes 4954 edges  Graph 6: 1250 nodes 7732 edges Simulated Graph: Random Graph created by Erdős–Rényi model

Statistical Comparison of Sampling Results  Comparison Categories (4) ：  Between directed and undirected Graph  American airline connection data and VAST data  Between different graph type of undirected or directed graphs  American airline connection data and Simulated undirected graph  Between multiple graphs of same type but different sizes.  Simulated data with different number of nodes(500, 750, 1000, 1250)  Between multiple graphs of same size and same type  Simulated Graph(500 VS 500, 1000 VS 1000)

Category 1: Between directed and undirected Graph (Undirected Graph)  Data: American Airlines connection data. Sampling rate: average(10%-50%)

Category 1: Between directed and undirected Graph (directed Graph)  Data: VAST 2013 Netflow data. Sampling rate: average(10%-50%)

Category 2: between two undirected graphs with different graph type  Data: American Airline connection data VS Simulated 500_1. Sampling rate: average(10%-50%)

Category 3: between multiple graphs of same type but different sizes  Data: Simulated data 500_1 VS Sampling rate: average(10%-50%)

Category 4: between multiple graphs of same size and same type  Data: Simulated data 500_1 VS 500_2. Sampling rate: average(10%-50%)

Category 4: between multiple graphs of same size and same type  Data: Simulated data 1000_1 VS 1000_2. Sampling rate: average(10%-50%)

Visual Comparison of Sampling Results  Visual Comparison  Data: American Airlines connection data  Graph Type: undirected graph  Sampling rate: 10% on edges  Visual Comparison Technique  Fix nodes location, label size, color etc.  Analysis Edge-related sampling methods are biased towards high-degree nodes. For example, random edge sampling, induced edge sampling, streaming edge sampling are easy to sample high degree nodes (136,50, 80,130,70 etc.)

Efficiency Comparison  Execution Time  Data  simulated undirected graph data  Sampling rate  Average of sampling result with sampling rate range from 10% to 50% on edges.

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Sampling on Big Graph  Graph Generation  Erdős–Rényi model  parallel algorithm (by MPi4py)  Shadow ІІ. Used 100 nodes, 2000 processors, each node: 512GB memory.  Size:  1 billion nodes  ~500 billion edges.  Graph Storage:  about 10TB

Sampling on Big Graph  Original Graph  Out Degree distribution

Sampling on Big Graph  Node sampling. Sampling Rate: 10 %  Out Degree Distribution:

Outline  Introduction  Motivation  Problems  Evaluation  KS-Distance  Graph Type and Properties  Sampling Methods  Node sampling  Edge sampling  Topology Based Sampling  Comparison of Sampling Results  Statistical Comparison  Visual Comparison  Efficiency Comparison  Sampling on Large Graph  Node Sampling  Out Degree Distribution  Conclusion

Conclusion  No sampling method works well for all graphs.  In visual comparison, the consistent graph layout facilitates comparison.  The benchmark helps users choose proper sampling methods in applications.  The benchmark provides an avenue to explore big graph.

Questions? Thanks! Acknowledgement T HIS WORK IS SUPPORTED BY THE P ACIFIC N ORTHWEST N ATIONAL L ABORATORY UNDER THE U.S. D EPARTMENT OF E NERGY C ONTRACT DE-AC05-76RL01830