1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.

Slides:

Advertisements

Similar presentations

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Advertisements

Fast Nearest-neighbor Search in Disk-resident Graphs 报告人：鲁轶奇.

Nonparametric Link Prediction in Dynamic Graphs Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Facebook) Michael Jordan (UC Berkeley) 1.

Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.

Absorbing Random walks Coverage

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

SASH Spatial Approximation Sample Hierarchy

Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Link Analysis, PageRank and Search Engines on the Web

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

1 Fast Incremental Proximity Search in Large Graphs Purnamrita Sarkar Andrew W. Moore Amit Prakash.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Fast Random Walk with Restart and Its Applications

1 Fast Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar Machine Learning Department Carnegie Mellon University.

Information Retrieval

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Models of Influence in Online Social Networks

Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

Random Walk with Restart (RWR) for Image Segmentation

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

Network of Epidemiology Digital Objects Naren Sundar, Kui Xu Client: Sandeep Gupta, S.M. Shamimul CS6604 Class Project.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

On Node Classification in Dynamic Content-based Networks.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

The Colorful Traveling Salesman Problem Yupei Xiong, Goldman, Sachs & Co. Bruce Golden, University of Maryland Edward Wasil, American University Presented.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

Slides are modified from Lada Adamic

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.

Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.

Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

A Modified Naïve Possibilistic Classifier for Numerical Data

Author: Kazunari Sugiyama, etc. (WWW2004)

Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16

Asymmetric Transitivity Preserving Graph Embedding

Panagiotis G. Ipeirotis Luis Gravano

Graph and Link Mining.

Learning to Rank Typed Graph Walks: Local and Global Approaches

Presentation transcript:

1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore

2 Talk Outline Ranking in graphs Reranking in graphs Harmonic functions for reranking Efficient algorithms Results

3 Graphs are everywhere The world wide web Publications - Citeseer, DBLP Friendship networks – Facebook Find webpages related to ‘CMU’ Find papers related to word SVM in DBLP Find other people similar to ‘Purna’ All are search problems in graphs

4 Graph Search: underlying question Given a query node, return k other nodes which are most similar to it Need a graph theoretic measure of similarity  minimum number of hops (Not robust enough)  average number of hops (huge number of paths!)  probability of reaching a node in a random walk

5 Graph Search: underlying technique Pick a favorite graph-based proximity measure and output top k nodes  Personalized Pagerank (Jeh, Widom 2003)  Hitting and Commute times (Aldous & Fill)  Simrank (Jeh, Widom 2002)  Fast random walk with restart (Tong, Faloutsos 2006)

6 Talk Outline Ranking in graphs Reranking in graphs Harmonic functions for reranking Efficient algorithms Results

7 Why do we need reranking? Search algorithms use -query node -graph structure Often unsatisfactory – ambiguous query – user does not know the right keyword User feedback Reranked list Current techniques (Jin et al, 2008) are too slow for this particular problem setting. We propose fast algorithms to obtain quick reranking of search results using random walks mouse

8 What is Reranking? User submits query to search engine Search engine returns top k results  p out of k results are relevant.  n out of k results are irrelevant.  User isn’t sure about the rest. Produce a new list such that  relevant results are at the top  irrelevant ones are at the bottom

9 Reranking as Semi-supervised Learning Given a graph and small set of labeled nodes, learn a function f that classifies all other nodes Want f to be smooth over the graph, i.e. a node classified as positive is  “near” the positive labeled nodes  “further away” from the negative labeled nodes Harmonic Functions!

10 Talk Outline Ranking in graphs Reranking in graphs Harmonic functions for reranking Efficient algorithms Results

11 Harmonic functions: applications Image segmentation (Grady, 2006) Automated image colorization (Levin et al, 2004) Web spam classification (Joshi et al, 2007) Classification (Zhu et all, 2003)

12 Harmonic functions in graphs Fix the function value at the labeled nodes, and compute the values of the other nodes. Function value at a node is the average of the function values of its neighbors Function value at node i Prob(i->j in one step)

13 Harmonic Function on a Graph Can be computed by solving a linear system  Not a good idea if the labeled set is changing quickly f(i,1) = Probability of hitting a 1 before a 0 f(i,0) = Probability of hitting a 0 before a 1 If graph is strongly connected we have f(i,1)+f(i,0)=1

14 T-step variant of a harmonic function f T (i,1) = Probability of hitting a node 1 before a node 0 in T steps f T (i,1)+f T (i,0) ≤ 1 Simple classification rule: node i is class ‘1’ if f T (i,1) ≥ f T (i,0) Want to use the information from negative labels more

15 Conditional probability Condition on the event that you hit some label conditional probability at i Probability of hitting a 1 before a 0 in T steps Probability of hitting some label in T steps Has no ranking information when f T (i,1)=0

16 Smoothed conditional probability If we assume equal priors on the two classes the smoothed version is When f T (i,1)=0, the smoothed function uses f T (i,0) for ranking.

17 A Toy Example 200 node graph  2 clusters  260 edges  30 inter-cluster edges Compute AUC score for T=5 and 10 for 20 labeled nodes  Vary the number of positive labels from 1 to 19  Average AUC score for 10 random runs for each configuration

18 For T=10 all measures perform well Unconditional becomes better as # of +ve’s increase. Conditional is good when the classes are balanced Smoothed conditional always works well. # of positive labels AUC score (higher is better)

19 Talk Outline Ranking in graphs Reranking in graphs Harmonic functions for reranking Efficient algorithms Results

20 Two application scenarios 1. Rank a subset of nodes in the graph 2. Rank all the nodes in the graph.

21 Application Scenario #1 User enters query Search engine generates ranklist for a query User enters relevance feedback Reason to believe that top 100 ranked nodes are the most relevant  Rank only those nodes.

22 Sampling Algorithm for Scenario #1 I have a set of candidate nodes Sample M paths of from each node.  A path ends if it reached length T  A path ends if it hits a labeled node Can compute estimates of harmonic function based on these With ‘enough’ samples these estimates get ‘close to’ the true value.

23 Application Scenario #2 My friend Ting Liu - Former grad student at -Works on machine learning Ting Liu from Harbin Institute of Technology -Director of an IR lab -Prolific author in NLP DBLP treats both as one node Majority of a ranked list of papers for “Ting Liu ” will be papers by the more prolific author. Cannot find relevant results by reranking only the top 100. Must rank all nodes in the graph

24 Branch and Bound for Scenario #2 Want  find top k nodes in harmonic measure Do not want  examine entire graph (labels are changing quickly over time) How about neighborhood expansion?  successfully used to compute Personalized Pagerank (Chakrabarti, ‘06), Hitting/Commute times (Sarkar, Moore, ‘06) and local partitions in graphs (Spielman, Teng, ‘04).

25 Branch & Bound: First Idea Find neighborhood S around labeled nodes Compute harmonic function only on the subset However  Completely ignores graph structure outside S  Poor approximation of harmonic function  Poor ranking

26 Branch & Bound: A Better Idea Gradually expand neighborhood S Compute upper and lower bounds on harmonic function of nodes inside S Expand until you are tired Rank nodes within S using upper and lower bounds Captures the influence of nodes outside S

27 Harmonic function on a Grid T=3 y=1 y=0

28 [0,.22] [.33,.56] Harmonic function on a Grid T=3 y=1 y=0 [lower bound, upper bound]

29 [0,.22] [.39,.5] [.43,.43] [.17,.17] [.11,.33] Harmonic function on a Grid T=3 tighter bounds! tightest y=1 y=0

30 Harmonic function on a Grid T=3 [0,0] [.43,.43] [1/9,1/9] [.43,.43] [.17,.17] [.11,.11] tight bounds for all nodes! Might miss good nodes outside neighborhood.

31 Branch & Bound: new and improved Given a neighborhood S around the labeled nodes Compute upper and lower bounds for all nodes inside S Compute a single upper bound ub(S) for all nodes outside S Expand until ub(S) ≤ α All nodes outside S are guaranteed to have harmonic function value smaller than α Guaranteed to find all good nodes in the entire graph

32 What if S is Large? S α = {i|f T ≥ α } L p = Set of positive nodes Intuition: S α is large if  α is small  the positive nodes are relatively more popular within S α  For undirected graphs we prove Size of S α Likelihood of hitting a positive label Number of steps α is in the denominator

33 Talk Outline Ranking in graphs Reranking in graphs Harmonic functions for reranking Efficient algorithms Results

34 An Example papers authors words Machine Learning for disease outbreak detection Bayesian Network structure learning, link prediction etc.

35 awm + disease + bayesian An Example papers authors words query

36 Results for awm, bayesian, disease Relevant Irrelevant

37 User gives relevance feedback relevant irrelevant papers authors words

38 Final classification Relevant results papers authors words

39 After reranking Relevant Irrelevant

40 Experiments DBLP: 200K words, 900K papers, 500K authors Two Layered graph [Used by all authors]  Papers and authors  1.4M nodes, 2.2 M edges Three Layered graph [Please look at the paper for more details]  Include 15K words (frequency > 20 and <5K)  1.4 M nodes,6M edges

41 Entity disambiguation task Pick 4 authors with the same surname “sarkar” and merge them into a single node. Now use a ranking algorithm (e.g. hitting time) to compute nearest neighbors from the merged node. Label the top L papers in this list. Use the rest of papers in the ranklist as testset and compute AUC score for different measures against the ground truth. Merge P. sarkar Q. sarkar R. sarkar S. sarkar sarkar Hitting time 1.Paper-564: S. sarkar 2.Paper-22: Q. sarkar 3.Paper-61: P. sarkar 4.Paper-1001:R. sarkar 5.Paper-121: R. sarkar 6.Paper-190: S. sarkar 7.Paper-88 : P. sarkar 8.Paper-1019:Q. sarkar P sarkarP sarkar Q sarkarQ sarkar R sarkarR sarkar S sarkarS sarkar sarkarsarkar 1.Paper-564: S. sarkar 2.Paper-22: Q. sarkar 3.Paper-61: P. sarkar 4.Paper-1001:R. sarkar 5.Paper-121: R. sarkar 6.Paper-190: S. sarkar 7.Paper-88 : P. sarkar 8.Paper-1019:Q. sarkar Want to find “P. sarkar” 1.Paper-564: S. sarkar 2.Paper-22: Q. sarkar 3.Paper-61: P. sarkar 4.Paper-1001:R. sarkar 5.Paper-121: R. sarkar 6.Paper-190: S. sarkar 7.Paper-88 : P. sarkar 8.Paper-1019:Q. sarkar relevant irrelevant 1.Paper-564: S. sarkar 2.Paper-22: Q. sarkar 3.Paper-61: P. sarkar 4.Paper-1001:R. sarkar 5.Paper-121: R. sarkar 6.Paper-190: S. sarkar 7.Paper-88 : P. sarkar 8.Paper-1019:Q. sarkar Test-set harmonic measure ground truth Compute AUC score

42 Effect of T T=10 is good enough Number of labels AUC score

43 Personalized Pagerank (PPV) from the positive nodes Conditional harmonic probability PPV from positive labels Number of labels AUC score

44 Timing Results for retrieving top 10 results in harmonic measure Two layered graph  Branch & bound: 1.6 seconds  Sampling from 1000 nodes: 90 seconds Three layered graph  See paper for results

45 Conclusion Proposed an on-the-fly reranking algorithm  Not an offline process over a static set of labels  Uses both positive and negative labels Introduced T-step harmonic functions  Takes care of skewed distribution of labels Highly efficient and scalable algorithms On quantitative entity disambiguation tasks from DBLP corpus we show  Effectiveness of using negative labels  Small T does not hurt  Please see paper for more experiments!

46 Thanks!

47 Reranking Challenges Must be performed on-the-fly  not an offline process over prior user feedback Should use both positive and negative feedback  and also deal with imbalanced feedack (e.g, “ many negative, few positive”)

48 Scenario #2: Sampling Sample M paths of from the source. A path ends if it reached length T A path ends if it hits a labeled node If M p of these hit a positive label and M n hit a negative label, then Can prove that with enough samples can get close enough estimates with high probability.

49 Hitting time from the positive nodes Two layered graph Conditional harmonic probability Hitting time from positive labels AUC Number of labels

50 Timing results The average degree increases by a factor of 3, and so does the average time for sampling. The expansion property (no. of nodes within 3-hops) increases by a factor 80 The time for BB increases by a factor of 20.