Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

 Which nodes are most similar to node i Friend suggestion in Facebook Paper #2 Paper #1 SVM margin maximum classification paper-has-word paper-cites-paper paper-has-word large scale Keyword specific search in DBLP

Random walk based measures - Personalized pagerank - Hitting and Commute times -... Possible query types: Find k most relevant papers about “support vector machines” Queries can be arbitrary Computing these measures at query-time is still an active area of research. 3 Intuitive measures of similarity Successfully used for many applications

Most algorithms 1 typically examine local neighborhoods around the query node - High degree nodes make them slow. When the graph is too large for memory - Streaming algorithms 2  Require multiple passes over the entire dataset. We want a external memory framework, that supports Arbitrary queries Amenable to many random walk based measures 1. Berkhin 2006, Anderson et al 2006, Chakrabarti 2007, Sarkar & Moore 2007 2. Das Sarma et al, 2008.

Introduction to some measures High degree nodes Disk-resident large graphs Results

Personalized Pagerank (PPV) Start at node i At any step reset to node i with probability α Stationary distribution of this process Discounted Hitting Time Start at node i At any step stop if you hit j or with probability α Expected time to stop 6

Introduction to some measures High degree nodes - Effect on personalized pagerank - Effect on discounted hitting times Disk-resident large graphs Results

Effect of high-degree nodes on random walks High degree nodes can blow up neighborhood size. Bad for computational efficiency. Real world graphs with power law degree distribution Very small number of high degree nodes But easily reachable because of the small world property 8

Main idea: When a random walk hits a high degree node, only a tiny fraction of the probability mass gets to its neighbors. Stop the random walk when it hits a high degree node  Turn the high degree nodes into sink nodes. 9 } p t degree=1000 t+1 degree=1000 p/1000

We are computing personalized pagerank from node i If we make node s into sink PPV(i,j) will decrease By how much? Can prove: the contribution through s is probability of hitting s from i * PPV (s,j) Is PPV(s,j) small if s has huge degree? 10 Can show that error at any node j is ≤ Undirected Graphs

Discounted hitting times: hitting times with a α probability of stopping at any step. Main intuition: PPV(i,j) = Pr α (hitting j from i) * PPV(j,j) 11 Individual popularity is normalized out We show 

Similar nodes should be placed nearby on a disk Cluster the graph into page-size chunks  random walk will stay mostly inside a good cluster  less computational overhead A real example

14 Robotics david_apfelbauu thomas_hoffmann kurt_kou daurel_ michael_beetz larry_wasserman john_langford kamal_nigam michael_ krell tom_m_mitchell howie_choset Machine learning and Statistics

15 Wolfram Burgard Dieter Fox Mark Craven Kamal Nigam Dirk Schulz Armin Cremers Tom Mitchell Top 7 nodes in personalized pagerank from Sebastian Thrun A random walk mostly stays inside a good cluster

16 1. Load cluster in memory. 2. Start random walk Page-fault every time the walk leaves the cluster Can also maintain a LRU buffer to store the clusters in memory. Can we do better than sampling? How to cluster an external memory graph? A Page-fault is recorded when a new page is loaded. Number of page-faults on average  Ratio of cross edges with total number of edges  Quality of a cluster A Page-fault is recorded when a new page is loaded. Number of page-faults on average  Ratio of cross edges with total number of edges  Quality of a cluster

Only compute PPV on the current cluster. No information from rest of the graph. Poor approximation 17 ? j NB j

Upper and lower bounds on PPV(i,j) for i in NB j Add new clusters when you expand. Maintain a global upper bound ub for nodes outside Stop when ub ≤ β  all nodes outside have small PPV 18 ? j NB j Expand ub Many fewer page-faults than sampling! We can also compute hitting time to node j using this algorithm.

Pick a measure for clustering Personalized pagerank – has been shown to yield good clustering 1 Compute PPV from a set of A anchor nodes, and assign a node to its closest anchor. How to compute it on disk? Personalized pagerank on disk Nodes/edges do not fit in memory: no random access  RWDISK 19 R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS '06.

Compute personalized pagerank using power iterations Each iteration = One matrix-vector multiplication Can compute by join operations between two lexicographically sorted files. Intermediate files can be large Round the small probabilities to zero at any step. 1 Has bounded error, but brings down file-size from O(n 2 )  O(|E|) 20 Spielman and Teng 2004, Sarlos et al. 2006

22 Turning high degree nodes into sinks Deterministic Algorithm vs. Sampling RWDISK on external memory graphs Yields better clusters than METIS with much less memory requirement. (will skip for now)

Citeseer subgraph : co-authorship graphs DBLP : paper-word-author graphs LiveJournal: online friendship network 23

DatasetMinimum degree of Sink Nodes AccuracyPage-faults CiteseerNone0.7469 1000.7467 DBLPNone0.11881 10000.58231 LiveJournalNone0.21502 1000.43255 24 8 times less 6 times less 6 times better 2 times better

25 DatasetMean page-faults Citeseer6 DBLP54 LiveJournal64 10 times less than sampling 4 times less than sampling

26 Turning high degree nodes into sinks Has bounded effect on personalized pagerank and hitting time Significantly improves the time of RWDISK (3-4 times). Improves number of page-faults in sampling a random walk Improves link prediction accuracy Search Algorithms on a clustered framework Sampling is easy to implement and can be applied widely Deterministic algorithm Guaranteed to not miss a potential nearest neighbor Improves number of page-faults significantly over sampling. RWDISK Fully external memory algorithm for clustering a graph

Thanks!

Personalized Pagerank Start at node i At any step reset to node i with probability α Stationary distribution of this process 28 PPV(i,j) = α Σ t (1- α) t P t (i,j) PPV from i to j

Maintain ub(NB j ) for all nodes outside NB j Stop when ub≤α Guaranteed to return all nodes with PPV(i,j)≥ α 29 j NB j ub(NB j ) Expand Many fewer page-faults than sampling! We can also compute PPV to node j using this algorithm.

DatasetSink NodesTime Minimum degreeNumber of sinks DBLPNone0≥ 2.5 days 100090011 hours LiveJournal100095060 hours 100134K17 hours 30 Minimum degree of a sink node Number of sinks 4 times faster 3 times faster

The histograms for the expected number of pagefaults if a random walk stepped outside a cluster for a randomly picked node. Left to right the panels are for Citeseer, DBLP and LiveJournal.

Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

Similar presentations

Presentation on theme: "Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

Similar presentations

Presentation on theme: "Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)"— Presentation transcript:

Similar presentations

About project

Feedback