Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.
Overview of this week Debugging tips for ML algorithms
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.
CMU SCS KDD 2006Leskovec & Faloutsos1 ??. CMU SCS KDD 2006Leskovec & Faloutsos2 Sampling from Large Graphs poster# 305 Jurij (Jure) Leskovec Christos.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
CS 376b Introduction to Computer Vision 04 / 08 / 2008 Instructor: Michael Eckmann.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Semidefinite Programming
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Reduced Support Vector Machine
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
CS 376b Introduction to Computer Vision 04 / 04 / 2008 Instructor: Michael Eckmann.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Exploratory Data Analysis on Graphs William Cohen.
Graph Partitioning Problem Kernighan and Lin Algorithm
CS 3343: Analysis of Algorithms Lecture 21: Introduction to Graphs.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one sample algorithm on graphs edges and nodes in memory nodes in memory nothing.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Scaling up Decision Trees. Decision tree learning.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Graph Algorithms: Properties of Graphs? William Cohen.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
CS 61B Data Structures and Programming Methodology Aug 4, 2008 David Sun.
Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
CSE 421 Algorithms Richard Anderson Winter 2009 Lecture 5.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Chapter 5. Greedy Algorithms
Lectures on Network Flows
CS 3343: Analysis of Algorithms
CSE 454 Advanced Internet Systems University of Washington
CS 3343: Analysis of Algorithms
CS200: Algorithm Analysis
Mayank Bhatt, Jayasi Mehar
Grouping.
Randomized Algorithms CS648
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk … – Sampling from a graph What is a good sample? (graph statistics) What methods work? (PPR/RWR) HW: PageRank-Nibble method + Gephi 1

Common statistics for graphs William Cohen 2

3 nodes edges avg degree diameter cooefficient of degree curve clustering coefficient (homophily) clustering coefficient (homophily)

An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample? 4

Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk … – Sampling from a graph What is a good sample? (graph statistics) What sampling methods work? (PPR/RWR) HW: PageRank-Nibble method + Gephi 5

KDD

Brief summary Define goals of sampling: – “scale-down” – find G’<G with similar statistics – “back in time”: for a growing G, find G’<G that is similar (statistically) to an earlier version of G Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly – … See how well they perform 7

Brief summary Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly RPN – random nodes, sampled by PageRank RDP – random nodes sampled by in-degree – RE – random edges – RJ – run PageRank’s “random surfer” for n steps – RW – run RWR’s “random surfer” for n steps – FF – repeatedly pick r(i) neighbors of i to “burn”, and then recursively sample from them 8

RWR/Personalized PageRank vs PR PageRank update: 9 Let v t+1 = cu + (1-c)Wv t Personalized PR/RWR update: Let v t+1 = cs + (1-c)Wv t s is the seed vector or personalization vector in RN it’s just a random unit vector

10% sample – pooled on five datasets 10

d-statistic measures agreement between distributions D=max{|F(x)-F’(x)|} where F, F’ are cdf’s max over nine different statistics 11

12

Goal An efficient way of running RWR on a large graph – can use only “random access” you can ask about the neighbors of a node, you can’t scan thru the graph common situation with APIs – leads to a plausible sampling strategy Jure & Christos’s experiments some formal results that justify it…. 13

FOCS

What is Local Graph Partitioning? GlobalLocal 15

What is Local Graph Partitioning? 16

What is Local Graph Partitioning? 17

What is Local Graph Partitioning? 18

What is Local Graph Partitioning? GlobalLocal 19

Key idea: a “sweep” Order all vertices in some way v i,1, v i,2, …. – Say, by personalized PageRank from a seed Pick a prefix v i,1, v i,2, …. v i,k that is “best” – …. 20

What is a “good” subgraph? the edges leaving S vol(S) is sum of deg(x) for x in S for small S: Prob(random edge leaves S) 21

Key idea: a “sweep” Order all vertices in some way v i,1, v i,2, …. – Say, by personalized PageRank from a seed Pick a prefix S={ v i,1, v i,2, …. v i,k } that is “best” – Minimal “conductance” ϕ(S) You can re-compute conductance incrementally as you add a new vertex so the sweep is fast 22

Main results of the paper 1.An approximate personalized PageRank computation that only touches nodes “near” the seed – but has small error relative to the true PageRank vector 2.A proof that a sweep over the approximate PageRank vector finds a cut with conductance sqrt(α ln m) – unless no good cut exists no subset S contains significantly more pass in the approximate PageRank than in a uniform distribution 23

Result 2 explains Jure & Christos’s experimental results with RW sampling: RW approximately picks up a random subcommunity (maybe with some extra nodes) Features like clustering coefficient, degree should be representative of the graph as a whole… which is roughly a mixture of subcommunities 24

Main results of the paper 1.An approximate personalized PageRank computation that only touches nodes “near” the seed – but has small error relative to the true PageRank vector This is a very useful technique to know about… 25

Random Walks avoids messy “dead ends”…. 26

Random Walks: PageRank 27

Random Walks: PageRank 28

Flashback: Zeno’s paradox Lance Armstrong and the tortoise have a race Lance is 10x faster Tortoise has a 1m head start at time 0 01 So, when Lance gets to 1m the tortoise is at 1.1m So, when Lance gets to 1.1m the tortoise is at 1.11m … So, when Lance gets to 1.11m the tortoise is at 1.111m … and Lance will never catch up -? … = ? unresolved until calculus was invented

Zeno: powned by telescoping sums Let x be less than 1. Then Example: x=0.1, and …. = = 10/9. 30

Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v 31

Racing through a graph? Let W[i,j] be Pr(walk to j from i)and let α be less than 1. Then: The matrix (I- αW) is the Laplacian of αW. Generally the Laplacian is (D- A) where D[i,i] is the degree of i in the adjacency matrix A. 32

Random Walks: PageRank 33

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Proof: define a matrix for the pr operator: R α s=pr( α,s) define a matrix for the pr operator: R α s=pr( α,s) 34

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Proof: 35

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Key idea in apr: do this “recursive step” repeatedly focus on nodes where finding PageRank from neighbors will be useful Recursively compute PageRank of “neighbors of s” (=sW), then adjust 36

Approximate PageRank: Key Idea p is current approximation (start at 0) r is set of “recursive calls to make” residual error start with all mass on s u is the node picked for the next call 37

Analysis linearity re-group & linearity pr(α, r - r(u)χ u ) +(1-α) pr(α, r(u)χ u W) = pr(α, r - r(u)χ u + (1-α) r(u)χ u W) 38

Approximate PageRank: Algorithm 39

Analysis So, at every point in the apr algorithm: Also, at each point, |r| 1 decreases by α * ε *degree(u), so: after T push operations where degree(i-th u)=d i, we know which bounds the size of r and p 40

Analysis With the invariant: This bounds the error of p relative to the PageRank vector. 41

Comments – API p,r are hash tables – they are small (1/εα) push just needs p, r, and neighbors of u Could implement with API: List neighbor(Node u) int degree(Node u) Could implement with API: List neighbor(Node u) int degree(Node u) d(v) = api.degree(v) 42

Comments - Ordering might pick the largest r(u)/d(u) … or… 43

Comments – Ordering for Scanning Scan repeatedly through an adjacency-list encoding of the graph For every line you read u, v 1,…,v d(u) such that r(u)/d(u) > ε : benefit: storage is O( 1/εα) for the hash tables, avoids any seeking 44

Possible optimizations? Much faster than doing random access the first few scans, but then slower the last few …there will be only a few ‘pushes’ per scan Optimizations you might imagine: – Parallelize? – Hybrid seek/scan: Index the nodes in the graph on the first scan Start seeking when you expect too few pushes to justify a scan – Say, less than one push/megabyte of scanning – Hotspots: Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scan Then rescan that smaller list of “hot spots” until their score drops below threshold. 45

Putting this together Given a graph – that’s too big for memory, and/or – that’s only accessible via API …we can extract a sample in an interesting area – Run the apr/rwr from a seed node – Sweep to find a low-conductance subset Then – compute statistics – test out some ideas – visualize it… 46