1 Fast Incremental Proximity Search in Large Graphs Purnamrita Sarkar Andrew W. Moore Amit Prakash.

Slides:



Advertisements
Similar presentations
1 Random Walks on Graphs: An Overview Purnamrita Sarkar.
Advertisements

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.
Nonparametric Link Prediction in Dynamic Graphs Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Facebook) Michael Jordan (UC Berkeley) 1.
Information Networks Link Analysis Ranking Lecture 8.
Maximizing the Spread of Influence through a Social Network
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Information Networks Small World Networks Lecture 5.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Absorbing Random walks Coverage
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Analysis of Network Diffusion and Distributed Network Algorithms Rajmohan Rajaraman Northeastern University, Boston May 2012 Chennai Network Optimization.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Graphical Models for Protein Kinetics Nina Singhal CS374 Presentation Nov. 1, 2005.
Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.) 1.
1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 Fast Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar Machine Learning Department Carnegie Mellon University.
On Distinguishing between Internet Power Law B Bu and Towsley Infocom 2002 Presented by.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
On Node Classification in Dynamic Content-based Networks.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Ranking Link-based Ranking (2° generation) Reading 21.
Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro
Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
RoundTripRank Graph-based Proximity with Importance and Specificity Yuan FangUniv. of Illinois at Urbana-Champaign Kevin C.-C. ChangUniv. of Illinois at.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Single-Pass Belief Propagation
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Hiroki Sayama NECSI Summer School 2008 Week 2: Complex Systems Modeling and Networks Network Models Hiroki Sayama
A Theoretical Justification of Link Prediction Heuristics
Uniform Sampling from the Web via Random Walks
Sublinear Algorithms for Personalized PageRank, with Applications
A Theoretical Justification of Link Prediction Heuristics
Theoretical Justification of Popular Link Prediction Heuristics
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Department of Computer Science University of York
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Scaling up Link Prediction with Ensembles
Asymmetric Transitivity Preserving Graph Embedding
Learning to Rank Typed Graph Walks: Local and Global Approaches
Presentation transcript:

1 Fast Incremental Proximity Search in Large Graphs Purnamrita Sarkar Andrew W. Moore Amit Prakash

Recommender systems Brand, M. (2005). A Random Walks Perspective on Maximizing Satisfaction and Profit. SIAM '05. Alice Bob Charlie What are the top k movie recommendations for Alice?

Content-based search in databases {1,2} 1. Dynamic personalized pagerank in entity-relation graphs. (Soumen Chakrabarti, WWW 2007) 2. Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). ObjectRank: Authority-based keyword search in databases. VLDB Paper #2 Paper #1 SVM margin maximum classification paper-has-word paper-cites-paper paper-has-word large scale Find top k papers matching “SVM” in Citeseer

Proximity measures on graphs For ranking we need proximity measures  Personalized pagerank  Hitting/commute times  Truncated hitting and commute times We will present an algorithm to compute nearest neighbors in truncated hitting and commute times on the fly. Empirically show that these truncated measures perform better than the others in most cases.

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Our contribution  Theoretical justification  Algorithm sketch Experimental results

Random walks Start at

Random walks Start at 1 Pick a neighbor i uniformly at random Move to i t=1

Random walks Start at 1 Pick a neighbor i uniformly at random Move to i Continue t=1 t=2

Random walks Start at 1 Pick a neighbor i uniformly at random Move to i Continue t=1 t=2 t=3 If the random walk hits a node many times, then its close to the start node! Hitting time!

Hitting time from i to j i h(i,j)=? j Graph G

Hitting time from i to j ij n1n1 n2n2 P(i,n 1 ) P(i,n 2 ) Graph G h(i,j)=1+…

Hitting time from i to j ij h(i,j)=1+P(i,n 1 )*h(n 1,j)+P(i,n 2 )*h(n 2,j) n1n1 n2n2 P(i,n 1 ) P(i,n 2 ) h(n 1,j) h(n 2,j) Graph G

Hitting time can be defined recursively Symmetric version: commute time Random walk-based proximity measures Aldous, D., & Fill, J. A. (2001). Reversible markov chains.

1.Liben-Nowell, D., & Kleinberg, J. The link prediction problem for social networks CIKM '03. 2.Brand, M. (2005). A Random Walks Perspective on Maximizing Satisfaction and Profit. SIAM '05. Hitting & Commute times: Drawbacks Predictive power Hitting time and Commute times often take into account very long paths. 1,2 You are more prone to pick up popular entities. 1,2  Bad for personalization.  Alice likes cartoons, so her top 10 recommendations should not be the 10 most popular movies. As a result these do not perform well for link prediction 1,2.

A better proximity measure based on truncated random walks We use a truncated version of hitting and commute times, which only considers paths of length at most T Sarkar, P., & Moore, A. (2007). A tractable approach to finding closest truncated-commute-time neighbors in large graphs. Proc. UAI.

15 nearest neighbors of node 95 (in red) Un-truncated hitting time Truncated hitting time Un-truncated vs truncated hitting time from a node Random walk gets lost here.

Truncated Hitting & Commute times For small T  Are not sensitive to long paths  Do not necessarily favor high degree nodes These are also easier to compute compared to un-truncated hitting and commute times Important for personalized search!

Nearest neighbors of a node: Computational complexity of truncated measures Hitting time to node j H T [i,j]=h T (i,j) Can compute using Dynamic programming in O(T*num_edges) Hitting time from node i Have to fill up entire matrix! O(num_nodes*T*num_edges)!! Previous Work : << O(num_nodes 2 )

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Proposed work  Theoretical justification  Algorithm sketch Experimental results

hitting time to node j GRANCH: computing hitting time to a node H T [i,j]=h T (i,j) Use dynamic programming to fill up interesting patches in the column Sarkar, P., & Moore, A. (2007). A tractable approach to finding closest truncated-commute-time neighbors in large graphs. Proc. UAI.

GRANCH: computing hitting time from a node H T [i,j]=h T (i,j) Fill up all columns

hitting time from node i GRANCH: computing hitting time from a node H T [i,j]=h T (i,j)

GRANCH: Pros & Cons Pros:  The amortized time and space per node is small  Great for computing nearest neighbors for all nodes! Cons:  Large pre-processing time: Looks at entire graph  Significant space required: caches neighborhoods for all nodes What if the graph changes between two consecutive queries?  Need fast, on the fly, space-efficient search algorithms!

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Proposed work  Theoretical justification  Algorithm sketch Experimental results

Current Work Proposed work Return k approximate nearest neighbors of the query node in truncated commute time without looking at entire graph.   DP Sampling Previous Work: GRANCH Hitting time TOHitting time FROM

Proposed work: sketch Return k approximate nearest neighbors of the query node with truncated commute time ≤ 2  Theoretical justification: number of nodes within 2  truncated commute time is small.  #nodes with hitting time ≤  from the query is small  #nodes with hitting time ≤  to the query is small We present an algorithm which adaptively finds a neighborhood which contains all nodes with commute time ≤ 2  w.h.p. Then we perform ranking on these nodes

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Proposed work  Theoretical justification  Algorithm sketch Experimental results

Number of nodes with hitting time ≤  from the query G What is the size of the set ? How many nodes will I hit within τ time?

Number of nodes with hitting time ≤  to the query G How many nodes will hit me within τ time? What is the size of the set ? Directed graphs: Not too many nodes will have a lot of neighbors within a small hitting time to them.

G What is the size of the set ? Stronger guarantee for undirected graphs! How many nodes will hit me within τ time? Number of nodes with hitting time ≤  to the query

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Proposed work  Theoretical justification  Algorithm sketch Experimental results

Algorithm sketch : Combining sampling and dynamic programming Use sampling to estimate hitting time from node i Maintain a neighborhood NB(i) around node i Use estimated hitting times and dynamic programming to compute bounds on commute time between node i and nodes in NB(i). As we expand the neighborhood these bounds get tighter. Rank using bounds on commute times.

Sampling for computing hitting time from a node T=5

Sampling for computing hitting time from a node T=5

Sampling for computing hitting time from a node T=5

Sampling for computing hitting time from a node T=5

Sampling for computing hitting time from a node T=5

Sampling for computing hitting time from a node h(1,5) = 1 h(1,9) = 2 h(1,7) = 4 h(1,-) = 5 T=5

Define X ij as the first hitting time at j If the random walk never hits j, X(i,j) = T Sampling for computing hitting time from a node

Sample complexity bounds Estimating h(i,j), j=1,…,n [Theorem]: The number of samples needed to achieve low error with high probability is proportional to log(n). Retrieving top k neighbors [Theorem]: Number of samples needed to retrieve top k neighbors with high probability is small when the gap between the true k th and k+1 th neighbor is large.

Outline Ranking Problem on graphs Background  Random walks  Standard and truncated measures  Previous work Proposed work  Theoretical justification  Algorithm sketch Experimental results

Citeseer graph: Average query processing time Find k nearest neighbors of a query node in truncated hitting/commute times 628,000 nodes. 2.8 Million edges on a single CPU machine.  Sampling (7,500 samples) 0.7 seconds  Exact truncated commute time: 88 seconds  Hybrid commute time: 4 seconds

Keyword-author-citation graphs Dynamic personalized pagerank in entity-relation graphs. (Chakrabarti, S., WWW 2007) Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). ObjectRank: Authority-based keyword search in databases. VLDB words papers authors Existing work use Personalized Pagerank (PPV). We present quantifiable link prediction tasks We compare PPV with truncated hitting and commute times.

Word Task words papers authors

Word Task words papers authors Rank the papers for these words. See if the paper comes up in top 1,3,5,…

Author Task words papers authors Rank the papers for these authors. See if the paper comes up in top 1,3,5,…

Word Task k Fraction of queries with the held out paper in top k

Author Task k Fraction of queries with the held out paper in top k

Conclusion On-the-fly algorithm to compute approximate nearest neighbors in truncated hitting and commute time. If the graph changes between consecutive queries, our algorithm is fast. On most quantifiable link prediction tasks our measures outperform personalized pagerank. On a single CPU machine, we can process queries in 4 seconds on average in a graph with 600,000 nodes and 3 million edges.

Acknowledgement Soumen Chakrabarti Anupam Gupta Geoffrey Gordon Tamás Sarlós Martin Zinkevich

Conclusion On-the-fly algorithm to compute approximate nearest neighbors in truncated hitting and commute time. If the graph changes between consecutive queries, our algorithm is fast. On most quantifiable link prediction tasks our measures outperform personalized pagerank. On a single CPU machine, we can process queries in 4 seconds on average in a graph with 600,000 nodes and 3 million edges.

Thank you

Hitting & Commute times: Drawbacks Computational complexity Recent “efficient” approximation algorithm for computing commute times in “undirected graphs ”1. For directed graphs, these measures are hard to compute. For many real world applications  Underlying graphs are directed  We need fast incremental algorithms for computing nearest neighbors of a query. 1. Spielman, D., & Srivastava, N. (2008). Graph sparsification by effective resistances. STOC'08

Approximate nearest neighbor Given , k,  for any node i, find k other nodes y within truncated commute time 2 , s.t. c T iy · c T ix (1+  ) where x is the true k-th-nearest neighbor.

Personalized pagerank Personalized pagerank for node i  Start from node i  At any timestep jump to node i with probability c  Stop when the distribution does not change.  Solve for v such that  r is a distribution

Properties of Truncated Hitting & Commute times For small T:  Are not sensitive to long paths  Do not favor high degree nodes For a randomly generated undirected graph, average correlation coefficient (R avg ) with the degree-sequence is  R avg with truncated hitting time is  R avg with untruncated hitting time is Important for personalized search!