Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

Partitional Algorithms to Detect Complex Clusters
Social network partition Presenter: Xiaofei Cao Partick Berg.
Overview of this week Debugging tips for ML algorithms
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Great Theoretical Ideas in Computer Science for Some.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
For Wednesday Read chapter 19, sections 1-3 No homework.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Exploratory Data Analysis on Graphs William Cohen.
Fixed Parameter Complexity Algorithms and Networks.
The Best Algorithms are Randomized Algorithms N. Harvey C&O Dept TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAA.
Finding dense components in weighted graphs Paul Horn
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one sample algorithm on graphs edges and nodes in memory nodes in memory nothing.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
 Building Networks. First Decisions  What do the nodes represent?  What do the edges represent?  Know this before doing anything with data!
Scaling up Decision Trees. Decision tree learning.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Graph Algorithms: Properties of Graphs? William Cohen.
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Author : Yang Xu, Lei Ma, Zhaobo Liu, H. Jonathan Chao Publisher : ANCS 2011 Presenter : Jo-Ning Yu Date : 2011/12/28.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 14, 2014.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Cohesive Subgraph Computation over Large Graphs
The Best Algorithms are Randomized Algorithms
Minimum Spanning Tree 8/7/2018 4:26 AM
Great Theoretical Ideas in Computer Science
CSE 454 Advanced Internet Systems University of Washington
CS 3343: Analysis of Algorithms
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
Machine Learning Today: Reading: Maria Florina Balcan
Algorithm design and Analysis
Discrete Mathematics for Computer Science
Graph Algorithms Ch. 5 Lin and Dyer.
A Faster Algorithm for Computing the Principal Sequence of Partitions
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Graph Algorithms Ch. 5 Lin and Dyer.
Presentation transcript:

Subsampling Graphs 1

RECAP OF PAGERANK-NIBBLE 2

Why I’m talking about graphs Lots of large data is graphs – Facebook, Twitter, citation data, and other social networks – The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks – Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks nodes = documents or words links connect document  word or word  document – Computer networks, biological networks (proteins, ecosystems, brains, …), … – Heterogeneous networks with multiple types of nodes people, groups, documents 3

Our first operation on graphs Local graph partitioning Why? – it’s about the best we can do in terms of subsampling/exploratory analysis of a graph 4

What is Local Graph Partitioning? GlobalLocal 5

What is Local Graph Partitioning? 6

Main results of the paper (Reid et al) 1.An approximate personalized PageRank computation that only touches nodes “near” the seed – but has small error relative to the true PageRank vector 2.A proof that a sweep over the approximate personalized PageRank vector finds a cut with conductance sqrt(α ln m) – unless no good cut exists no subset S contains significantly more mass in the approximate PageRank than in a uniform distribution 7

Approximate PageRank: Key Idea p is current approximation (start at 0) r is set of “recursive calls to make” residual error start with all mass on s u is the node picked for the next call 8

ANALYSIS OF PAGERANK-NIBBLE 9

Analysis linearity re-group & linearity 10

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Proof: 11

Approximate PageRank: Algorithm 12

Analysis So, at every point in the apr algorithm: Also, at each point, |r| 1 decreases by α * ε *degree(u), so: after T push operations where degree(i-th u)=d i, we know which bounds the size of r and p 13

Analysis With the invariant: This bounds the error of p relative to the PageRank vector. 14

Comments – API p,r are hash tables – they are small (1/εα) push just needs p, r, and neighbors of u Could implement with API: List neighbor(Node u) int degree(Node u) Could implement with API: List neighbor(Node u) int degree(Node u) d(v) = api.degree(v) 15

Comments - Ordering might pick the largest r(u)/d(u) … or… 16

Comments – Ordering for Scanning Scan repeatedly through an adjacency-list encoding of the graph For every line you read u, v 1,…,v d(u) such that r(u)/d(u) > ε : benefit: storage is O( 1/εα) for the hash tables, avoids any seeking 17

Possible optimizations? Much faster than doing random access the first few scans, but then slower the last few …there will be only a few ‘pushes’ per scan Optimizations you might imagine: – Parallelize? – Hybrid seek/scan: Index the nodes in the graph on the first scan Start seeking when you expect too few pushes to justify a scan – Say, less than one push/megabyte of scanning – Hotspots: Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scan Then rescan that smaller list of “hot spots” until their score drops below threshold. 18

After computing apr… Given a graph – that’s too big for memory, and/or – that’s only accessible via API …we can extract a sample in an interesting area – Run the apr/rwr from a seed node – Sweep to find a low-conductance subset Then – compute statistics – test out some ideas – visualize it… 19

Key idea: a “sweep” Order vertices v 1, v 2, …. by highest apr(s) Pick a prefix S={ v 1, v 2, …. v k }: – S={}; volS =0; B={} – For k=1,…n: S += {v k } volS += degree(v k ) B += {u: v k  u and u not in S} Φ[k] = |B|/volS – Pick k: Φ[k] is smallest 20

Scoring a graph prefix the edges leaving S vol(S) is sum of deg(x) for x in S for small S: Prob(random edge leaves S) 21

Putting this together Given a graph – that’s too big for memory, and/or – that’s only accessible via API …we can extract a sample in an interesting area – Run the apr/rwr from a seed node – Sweep to find a low-conductance subset Then – compute statistics – test out some ideas – visualize it… 22

Visualizing a Graph with Gephi 23

Screen shots/demo Gelphi – java tool Reads several inputs –.csv file – [demo] 24

25

26

27

28

29

30

31

32

33

34