Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Overview of this week Debugging tips for ML algorithms

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.

Analysis and Modeling of Social Networks Foudalis Ilias.

Exact Inference in Bayes Nets

Fast Nearest-neighbor Search in Disk-resident Graphs 报告人：鲁轶奇.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Absorbing Random walks Coverage

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Evaluating Search Engine

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.

1 Fast Incremental Proximity Search in Large Graphs Purnamrita Sarkar Andrew W. Moore Amit Prakash.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

1 Fast Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar Machine Learning Department Carnegie Mellon University.

Correctness of Gossip-Based Membership under Message Loss Maxim Gurevich, Idit Keidar Technion.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

FLANN Fast Library for Approximate Nearest Neighbors

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.

Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.

Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Bahman Bahmani Stanford University

Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.

Data Structures and Algorithms in Parallel Computing Lecture 7.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.

Cohesive Subgraph Computation over Large Graphs

FORA: Simple and Effective Approximate Single-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.

On Growth of Limited Scale-free Overlay Network Topologies

Link-Based Ranking Seminar Social Media Mining University UC3M

CSE 454 Advanced Internet Systems University of Washington

Sublinear Algorithms for Personalized PageRank, with Applications

Lecture 7: Dynamic sampling Dimension Reduction

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

Statistical properties of network community structure

Fast and Exact K-Means Clustering

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

 Which nodes are most similar to node i Friend suggestion in Facebook Paper #2 Paper #1 SVM margin maximum classification paper-has-word paper-cites-paper paper-has-word large scale Keyword specific search in DBLP

Random walk based measures - Personalized pagerank - Hitting and Commute times -... Possible query types: Find k most relevant papers about “support vector machines” Queries can be arbitrary Computing these measures at query-time is still an active area of research. 3 Intuitive measures of similarity Successfully used for many applications

Most algorithms 1 typically examine local neighborhoods around the query node - High degree nodes make them slow. When the graph is too large for memory - Streaming algorithms 2  Require multiple passes over the entire dataset. We want a external memory framework, that supports Arbitrary queries Amenable to many random walk based measures 1. Berkhin 2006, Anderson et al 2006, Chakrabarti 2007, Sarkar & Moore Das Sarma et al, 2008.

Introduction to some measures High degree nodes Disk-resident large graphs Results

Personalized Pagerank (PPV) Start at node i At any step reset to node i with probability α Stationary distribution of this process Discounted Hitting Time Start at node i At any step stop if you hit j or with probability α Expected time to stop 6

Introduction to some measures High degree nodes - Effect on personalized pagerank - Effect on discounted hitting times Disk-resident large graphs Results

Effect of high-degree nodes on random walks High degree nodes can blow up neighborhood size. Bad for computational efficiency. Real world graphs with power law degree distribution Very small number of high degree nodes But easily reachable because of the small world property 8

Main idea: When a random walk hits a high degree node, only a tiny fraction of the probability mass gets to its neighbors. Stop the random walk when it hits a high degree node  Turn the high degree nodes into sink nodes. 9 } p t degree=1000 t+1 degree=1000 p/1000

We are computing personalized pagerank from node i If we make node s into sink PPV(i,j) will decrease By how much? Can prove: the contribution through s is probability of hitting s from i * PPV (s,j) Is PPV(s,j) small if s has huge degree? 10 Can show that error at any node j is ≤ Undirected Graphs

Discounted hitting times: hitting times with a α probability of stopping at any step. Main intuition: PPV(i,j) = Pr α (hitting j from i) * PPV(j,j) 11 Individual popularity is normalized out We show 

Introduction to some measures High degree nodes Disk-resident large graphs Results

Similar nodes should be placed nearby on a disk Cluster the graph into page-size chunks  random walk will stay mostly inside a good cluster  less computational overhead A real example

14 Robotics david_apfelbauu thomas_hoffmann kurt_kou daurel_ michael_beetz larry_wasserman john_langford kamal_nigam michael_ krell tom_m_mitchell howie_choset Machine learning and Statistics

15 Wolfram Burgard Dieter Fox Mark Craven Kamal Nigam Dirk Schulz Armin Cremers Tom Mitchell Top 7 nodes in personalized pagerank from Sebastian Thrun A random walk mostly stays inside a good cluster

16 1. Load cluster in memory. 2. Start random walk Page-fault every time the walk leaves the cluster Can also maintain a LRU buffer to store the clusters in memory. Can we do better than sampling? How to cluster an external memory graph? A Page-fault is recorded when a new page is loaded. Number of page-faults on average  Ratio of cross edges with total number of edges  Quality of a cluster A Page-fault is recorded when a new page is loaded. Number of page-faults on average  Ratio of cross edges with total number of edges  Quality of a cluster

Only compute PPV on the current cluster. No information from rest of the graph. Poor approximation 17 ? j NB j

Upper and lower bounds on PPV(i,j) for i in NB j Add new clusters when you expand. Maintain a global upper bound ub for nodes outside Stop when ub ≤ β  all nodes outside have small PPV 18 ? j NB j Expand ub Many fewer page-faults than sampling! We can also compute hitting time to node j using this algorithm.

Pick a measure for clustering Personalized pagerank – has been shown to yield good clustering 1 Compute PPV from a set of A anchor nodes, and assign a node to its closest anchor. How to compute it on disk? Personalized pagerank on disk Nodes/edges do not fit in memory: no random access  RWDISK 19 R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS '06.

Compute personalized pagerank using power iterations Each iteration = One matrix-vector multiplication Can compute by join operations between two lexicographically sorted files. Intermediate files can be large Round the small probabilities to zero at any step. 1 Has bounded error, but brings down file-size from O(n 2 )  O(|E|) 20 Spielman and Teng 2004, Sarlos et al. 2006

Introduction to some measures High degree nodes Disk-resident large graphs Results

22 Turning high degree nodes into sinks Deterministic Algorithm vs. Sampling RWDISK on external memory graphs Yields better clusters than METIS with much less memory requirement. (will skip for now)

Citeseer subgraph : co-authorship graphs DBLP : paper-word-author graphs LiveJournal: online friendship network 23

DatasetMinimum degree of Sink Nodes AccuracyPage-faults CiteseerNone DBLPNone LiveJournalNone times less 6 times less 6 times better 2 times better

25 DatasetMean page-faults Citeseer6 DBLP54 LiveJournal64 10 times less than sampling 4 times less than sampling

26 Turning high degree nodes into sinks Has bounded effect on personalized pagerank and hitting time Significantly improves the time of RWDISK (3-4 times). Improves number of page-faults in sampling a random walk Improves link prediction accuracy Search Algorithms on a clustered framework Sampling is easy to implement and can be applied widely Deterministic algorithm Guaranteed to not miss a potential nearest neighbor Improves number of page-faults significantly over sampling. RWDISK Fully external memory algorithm for clustering a graph

Thanks!

Personalized Pagerank Start at node i At any step reset to node i with probability α Stationary distribution of this process 28 PPV(i,j) = α Σ t (1- α) t P t (i,j) PPV from i to j

Maintain ub(NB j ) for all nodes outside NB j Stop when ub≤α Guaranteed to return all nodes with PPV(i,j)≥ α 29 j NB j ub(NB j ) Expand Many fewer page-faults than sampling! We can also compute PPV to node j using this algorithm.

DatasetSink NodesTime Minimum degreeNumber of sinks DBLPNone0≥ 2.5 days hours LiveJournal hours K17 hours 30 Minimum degree of a sink node Number of sinks 4 times faster 3 times faster

The histograms for the expected number of pagefaults if a random walk stepped outside a cluster for a randomly picked node. Left to right the panels are for Citeseer, DBLP and LiveJournal.