Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

Great Theoretical Ideas in Computer Science
Overview of this week Debugging tips for ML algorithms
Great Theoretical Ideas in Computer Science for Some.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.
CMU SCS KDD 2006Leskovec & Faloutsos1 ??. CMU SCS KDD 2006Leskovec & Faloutsos2 Sampling from Large Graphs poster# 305 Jurij (Jure) Leskovec Christos.
Semidefinite Programming
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Clustering Unsupervised learning Generating “classes”
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Exploratory Data Analysis on Graphs William Cohen.
CS 3343: Analysis of Algorithms Lecture 21: Introduction to Graphs.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Analysis of Social Media MLD , LTI William Cohen
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one sample algorithm on graphs edges and nodes in memory nodes in memory nothing.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Scaling up Decision Trees. Decision tree learning.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Graph Algorithms: Properties of Graphs? William Cohen.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
15-853:Algorithms in the Real World
CS654: Digital Image Analysis
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Analysis of Social Media MLD , LTI William Cohen
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Analysis of Social Media MLD , LTI William Cohen
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Semi-Supervised Learning With Graphs William Cohen.
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Chapter 5. Greedy Algorithms
Topics In Social Computing (67810)
Solving Linear Systems Ax=b
Lectures on Network Flows
Great Theoretical Ideas in Computer Science
CS 3343: Analysis of Algorithms
Community detection in graphs
CSE 454 Advanced Internet Systems University of Washington
CS 3343: Analysis of Algorithms
Apache Spark & Complex Network
CIS 700: “algorithms for Big Data”
Diffusion in Networks
Presentation transcript:

Graph Algorithms - continued William Cohen

Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing in memory – (Semi)Supervised learning on graphs – Properties of (social) graphs – Joey Gonzales guest lecture GraphLab

Recap: Why I’m talking about graphs Lots of large data is graphs – Facebook, Twitter, citation data, and other social networks – The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks – Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks nodes = documents or words links connect document  word or word  document – Computer networks, biological networks (proteins, ecosystems, brains, …), … – Heterogeneous networks with multiple types of nodes people, groups, documents

Today’s question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

Today’s important question How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample? This is the next assignment – Sampling from a graph

Degree distribution Plot cumulative degree – X axis is degree – Y axis is #nodes that have degree at least k Typically use a log-log scale – Straight lines are a power law; normal curve dives to zero at some point – Left: trust network in epinions web site from Richardson & Domingos

Homophily Another def’n: excess edges between common neighbors of v

An important question How do you explore a dataset? – compute statistics (e.g., feature histograms, conditional feature histograms, correlation coefficients, …) – sample and inspect run a bunch of small-scale experiments How do you explore a graph? – compute statistics (degree distribution, …) – sample and inspect how do you sample?

KDD 2006

Brief summary Define goals of sampling: – “scale-down” – find G’<G with similar statistics – “back in time”: for a growing G, find G’<G that is similar (statistically) to an earlier version of G Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly – … See how well they perform

Brief summary Experiment on real graphs with plausible sampling methods, such as – RN – random nodes, sampled uniformly RPN – random nodes, sampled by PageRank RDP – random nodes sampled by in-degree – RE – random edges – RJ – run PageRank’s “random surfer” for n steps – RW – run RWR’s “random surfer” for n steps – FF – repeatedly pick r(i) neighbors of i to “burn”, and then recursively sample from them

10% sample – pooled on five datasets

d-statistic measures agreement between distributions D=max{|F(x)-F’(x)|} where F, F’ are cdf’s max over nine different statistics

Goal An efficient way of running RWR on a large graph – can use only “random access” you can ask about the neighbors of a node, you can’t scan thru the graph common situation with APIs – leads to a plausible sampling strategy Jure & Christos’s experiments some formal results that justify it….

FOCS 2006

What is Local Graph Partitioning? GlobalLocal

What is Local Graph Partitioning?

GlobalLocal

Key idea: a “sweep” Order all vertices in some way v i,1, v i,2, …. – Say, by personalized PageRank from a seed Pick a prefix v i,1, v i,2, …. v i,k that is “best” – ….

What is a “good” subgraph? the edges leaving S vol(S) is sum of deg(x) for x in S for small S: Prob(random edge leaves S)

Key idea: a “sweep” Order all vertices in some way v i,1, v i,2, …. – Say, by personalized PageRank from a seed Pick a prefix S={ v i,1, v i,2, …. v i,k } that is “best” – Minimal “conductance” ϕ(S) You can re-compute conductance incrementally as you add a new vertex so the sweep is fast

Main results of the paper 1.An approximate personalized PageRank computation that only touches nodes “near” the seed – but has small error relative to the true PageRank vector 2.A proof that a sweep over the approximate PageRank vector finds a cut with conductance sqrt(α ln m) – unless no good cut exists no subset S contains significantly more pass in the approximate PageRank than in a uniform distribution

Result 2 explains Jure & Christos’s experimental results with RW sampling: RW approximately picks up a random subcommunity (maybe with some extra nodes) Features like clustering coefficient, degree should be representative of the graph as a whole… which is roughly a mixture of subcommunities

Main results of the paper 1.An approximate personalized PageRank computation that only touches nodes “near” the seed – but has small error relative to the true PageRank vector This is a very useful technique to know about…

Random Walks avoids messy “dead ends”….

Random Walks: PageRank

Flashback: Zeno’s paradox Lance Armstrong and the tortoise have a race Lance is 10x faster Tortoise has a 1m head start at time 0 01 So, when Lance gets to 1m the tortoise is at 1.1m So, when Lance gets to 1.1m the tortoise is at 1.11m … So, when Lance gets to 1.11m the tortoise is at 1.111m … and Lance will never catch up -? … = ? unresolved until calculus was invented

Zeno: powned by telescoping sums Let x be less than 1. Then Example: x=0.1, and …. = = 10/9.

Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

Racing through a graph? Let W[i,j] be Pr(walk to j from i)and let α be less than 1. Then: The matrix (I- αW) is the Laplacian of αW. Generally the Laplacian is (D- A) where D[i,i] is the degree of i in the adjacency matrix A.

Random Walks: PageRank

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Proof: plug in __ for R α

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Key idea in apr: do this “recursive step” repeatedly focus on nodes where finding PageRank from neighbors will be useful Recursively compute PageRank of “neighbors of s” (=sW), then adjust

Approximate PageRank: Key Idea p is current approximation (start at 0) r is set of “recursive calls to make” residual error start with all mass on s u is the node picked for the next call

Analysis linearity re-group & linearity pr(α, r - r(u)χ u ) +(1-α) pr(α, r(u)χ u W) = pr(α, r - r(u)χ u + (1-α) r(u)χ u W)

Approximate PageRank: Algorithm

Analysis So, at every point in the apr algorithm: Also, at each point, |r| 1 decreases by α * ε *degree(u), so: after T push operations where degree(i-th u)=d i, we know which bounds the size of r and p

Analysis With the invariant: This bounds the error of p relative to the PageRank vector.

Comments – API p,r are hash tables – they are small (1/εα) push just needs p, r, and neighbors of u Could implement with API: List neighbor(Node u) int degree(Node u) Could implement with API: List neighbor(Node u) int degree(Node u) d(v) = api.degree(v)

Comments - Ordering might pick the largest r(u)/d(u) … or…

Comments – Ordering for Scanning Scan repeatedly through an adjacency-list encoding of the graph For every line you read u, v 1,…,v d(u) such that r(u)/d(u) > ε : benefit: storage is O( 1/εα) for the hash tables, avoids any seeking

Possible optimizations? Much faster than doing random access the first few scans, but then slower the last few …there will be only a few ‘pushes’ per scan Optimizations you might imagine: – Parallelize? – Hybrid seek/scan: Index the nodes in the graph on the first scan Start seeking when you expect too few pushes to justify a scan – Say, less than one push/megabyte of scanning – Hotspots: Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scan Then rescan that smaller list of “hot spots” until their score drops below threshold.

Putting this together Given a graph – that’s too big for memory, and/or – that’s only accessible via API …we can extract a sample in an interesting area – Run the apr/rwr from a seed node – Sweep to find a low-conductance subset Then – compute statistics – test out some ideas – visualize it…

Screen shots/demo Gelphi – java tool Reads several inputs –.csv file