Glen Jeh & Jennifer Widom KDD 2002.  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.

Slides:



Advertisements
Similar presentations
Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Link Analysis: PageRank
Lectures on Network Flows
Entropy Rates of a Stochastic Process
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
ON LINK-BASED SIMILARITY JOIN A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
DAST 2005 Tirgul 12 (and more) sample questions. DAST 2005 Q.We’ve seen that solving the shortest paths problem requires O(VE) time using the Belman-Ford.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Random Walk with Restart (RWR) for Image Segmentation
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
Data Structures & Algorithms Graphs
Trust Management for the Semantic Web Matthew Richardson1†, Rakesh Agrawal2, Pedro Domingos By Tyrone Cadenhead.
SimRank : A Measure of Structural-Context Similarity
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Slides are modified from Lada Adamic
Bipartite Matching. Unweighted Bipartite Matching.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Section 9.3. Section Summary Representing Relations using Matrices Representing Relations using Digraphs.
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Neighborhood - based Tag Prediction
Markov Chains and Random Walks
Main algorithm with recursion: We’ll have a function DFS that initializes, and then calls DFS-Visit, which is a recursive function and does the depth first.
CIKM’ 09 November 3rd, 2009, Hong Kong
Lectures on Network Flows
Enumerating Distances Using Spanners of Bounded Degree
Lecture 22 SVD, Eigenvector, and Web Search
Instructor: Shengyu Zhang
SAT-Based Area Recovery in Technology Mapping
Probably Approximately
Department of Computer Science University of York
Graph Clustering based on Random Walk
Zhenjiang Lin, Michael R. Lyu and Irwin King
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Asymmetric Transitivity Preserving Graph Embedding
Algorithms (2IL15) – Lecture 7
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Glen Jeh & Jennifer Widom KDD 2002

 Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search for “Related Works” among scientific papers  But “similarity” may be domain-dependent.  Can we define a generic model for similarity?

 What do all these applications have in common?  data set of objects linked by a set of relations.  Then, a generic concept of similarity is structural-context similarity.  “Two objects are similar if the relate to similar objects.”  Recall automorphic equivalence:  “Two objects are equivalent if the relate to equivalent objects.”

 Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural- context similarity.

 Directed Graph G = (V,E)  V = set of objects  E = set of unweighted edges  Edge (u,v) exists if there is an relation u  v  I(v) = set of in-neighbors of vertex v  O(v) = set of out-neighbors of vertex v

 Recursive Model  “Two objects are similar if they are referenced by similar objects”  That is, a ~ b if  c  a and d  b, and  c ~ d  An object is equivalent to itself (score = 1)  Example 1.ProfA ~ ProfB because both are referenced by Univ. 2.StudentA ~ StudentB because they are referenced by similar nodes {ProfA,ProfB}

 s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b  s(a,b) is in the range [0, 1]  If a=b, then s(a,b) = 1  If a≠b,  C is a constant, 0 < C < 1  if I(a) or I(b) = ∅, then s(a,b) = 0

 X is identical to itself: s(x,x) = 1  Since we have x  a and x  b, should s(a,b) = 1 also?  If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1.  But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1.  Therefore, the constant C expresses our limited confidence or decay with distance: s(a,b) = C ∙ average similarity of (I(a), I(b)) x b a

 Given graph G, define G 2 =(V 2, E 2 ) where  V 2 =V x V. Each vertex in V 2 is a pair of vertices in V.  E 2 : (a,b)  (c,d) in G 2 iff a  c and b  d in G  Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.

 SimRank score for a vertex (a,b) in G 2 = similarity between a and b in G.  The source of similarity is self-vertices, like (Univ, Univ).  Then, similarity propagates along pair-paths in G2, away from the sources.  Note that values decrease away from (Univ, Univ)

 Bipartite: 2 types of objects  Example: Buyers and Items

 Two types of similarity:  Two buyers are similar if they buy the similar items  Out-neighbors of buyers are relevant:  Two items are similar if they are bought by similar buyers  In-neighbors of items are relevant:  In general, we can use I(.) and/or O(.) for any graph

 Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1}  SimRank compares each course of A with each course of B  But intuitively we just want the best matching pairs: s(Eng1 A,Eng1 B ), s(Math1A,Math1B), etc.  Solution: Two steps  Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction: Min: Final s(A,B) is the smaller of s A (A,B) and s B (A,B) [weakest link]

 R k (a,b) = estimate of SimRank after k iterations.  Initialization:  Iteration:  R k (a,b) is the similarity that has flowed a distance k away from the sources. R k values are non-decreasing as k increases.  We can prove that R k (a,b) converges to s(a,b)

 Space complexity : O(n 2 ) to store R k (a,b)  Time complexity : O(kn 2 d 2 ), d 2 is the average of |I(a)||I(b)| over all vertex pairs (a,b)  To improve performance, we can prune G 2 :  Idea: vertices that are far apart should have very low similarity. We can approximate it as 0.  Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G 2.  space complexity: O(nd r )  time complexity: O(Knd r d 2 ), dr = avg. number of neighbors within radius r.

 SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards  Background: Basic Forward Random Walk  Motion is in discrete steps, using edges of the graph.  Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors.  Given adjacency matrix A, the probability of walking from x to y is p xy = a xy /O(x).  Random Walk as a Markov Process  Initial location is described by the prob. distribution vector π (0)  Prob. of being at y at time 1 :

 Given adjacency matrix A:  The forward and backward transition matrices:

 Probability of walking backwards to x in one step:  Two walkers meet at x if they start at a and b, and if one goes x  a and the other goes x  b, respectively. s x (a,b) = P(meeting at x) = π(a,b) p(x  a) p(x  b) s(a,b) = P(meeting) = Σ x π(a,b) p(x  a) p(x  b)  If they start together, they have met, so s (0) xy = 1 if i = j ; 0 otherwise [identity matrix]  Then

 Two data sets  ResearchIndex (  a corpus of scientific research papers  688,898 cross-reference among 278,628 papers  Student’s transcripts  1030 undergraduate students in the School of Engineering at Stanford University  Each transcript lists all course that the student has taken so far (average: 40 courses/student)

 Problem: Difficult to know what is the “correct” similarity between items.  Solution: Define a rough domain-specific metric σ(p,q):  For scientific papers, we have two versions: σ C (p,q) = fraction of q’s citations also cited by p σ T (p,q) = fraction of words in q’s title also in p’s title  For university courses: σ D (p,q) = 1 if p, q are in the same department, else 0

 Run the similarity algorithms:  SimRank (naïve, pruned, minmax)  Co-Citation  For each object p and algorithm A, form a set top A,N (p) of the N objects most similar to p.  For each q ∈ top A,N (p), compute σ(p,q).  Return the average σ A,N (p) over all q.

 Setup  Used bipartite SimRank, only considering in- neighbors (validation uses out-neighbors)  N ∈ {5, 10, …, 45, 50}  Results  Not very sensitive to decay factors C 1 and C 2  Pruning the search radius had little effort on rank order of scores.

 Setup  Bipartite domain  N ∈ {5, 10}  Results  Min-Max version of SimRank performed the best  Not very sensitive to decay factors C1 and C2

Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

 Defined a recursive model of structural similarity between objects in a network  Mathematically formulated SimRank based on the recursive concept  Presented a convergent algorithm to compute SimRank  Described a random-walk interpretation of SimRank equations and scores  Experimentally validated SimRank over two real data sets

 O(n 2 ) is large; scalability needs to be improved.  s(a,b) only includes contributions for paths when a and b are the same distance from some x. What if the distances are offset (total is odd)?  As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)!  Addressed partially by Minimax method