1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity
2 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
3 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
4 1. Motivation of RWR Real World Problems: a. Ranking Problem: F riendship in social network Keyword search in WWW Scientific papers citation b. Link Prediction Problem: Spam prediction in WWW Network Recommender system in market Common problem: Given a node/nodes in a graph, which other nodes are (most) similar to this node/nodes? one node: single source all nodes: all pairs
5 1. Motivation of RWR Similarity: Input: topological graph: nodes, edges Output: similarity of other nodes to query nodes a. Shortest path b. Common neighbors Bibliometrics of scientific papers: co-citation, bibliographic coupling Web pages: hub and authority p q a b... n p q a b n
6 1. Motivation of RWR Similarity: Input: nodes, topological graph Output: similarity of other nodes to query nodes Observation: two objects are similar if they are related to similar objects N2N2 N1N1 Intuition: similarity can transfer from node pairs to other node pairs Nodes Edges Nodes pair order “simplified”
7 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
8 SimRank: 2. SimRank and other Versions Basic Version Base case: a=b, s(a,b)=1; Special case: |I(a)|=0 or |I(b)|=0, s(a,b)=0 “c”: confidence factor Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: Information Flow: I: How much similarity(infor) can flow to a and b from (similarity) sources
9 SimRank: 2. SimRank and other Versions Basic Version Base case: a=b, s(a,b)=1; Special case: |I(a)|=0 or |I(b)|=0, s(a,b)=0 “c”: confidence factor {ProfA,ProfA} In G 2 : 15 nodes 21 edges {StudentA,StudentA} {StudentB,StudentB} {ProfB,ProfB} {ProfA,StudentA} {Univ,ProfA} {StudentA,Univ} s{ProfA,StudentA} =c s{Univ,ProfA} =c 2 s{StudentA,Univ} = c 3 s{ProfA,StudentA} s{ProfA,StudentA} =s{Univ,ProfA} =s{StudentA,Univ} =0
10 SimRank: 2. SimRank and other Versions Bipartite Version Recommender System
11 SimRank: 2. SimRank and other Versions Bipartite Version in Homogeneous Domain Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: Web pages: hub and authority a b d f e c g “points to”similarity score: “pointed to”similarity score:
12 SimRank: 2. SimRank and other Versions Minimax Version Students (A, B) and Courses (c, d) A, B are two students in the same major Some courses they selected are same, e.g. curricular requirement But they must select some elective courses: A c, B d Prob: what’s the similarity of two (diff major) students (inverse: what’s the similarity of two (elective) courses)
13 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
14 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method (iteratively calculate until converge) Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1).
15 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method (iteratively computation) Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.Existence: lim(R k ) (R k will converge) 2.Correctness: lim(R k )=s (0≤lim(R k )≤1) 3.Uniqueness: lim(R k ) is unique
16 Complexity of Naive method: 2. SimRank and other Versions Naive method: Iterative fix-point method
17 Pruning Strategy: 2. SimRank and other Versions “When n is significantly large, it is very likely that the neighborhood (say, nodes within a radius of 2 or 3) of a typical node will be a very small percentage (< 1%) of the entire domain.” “If we consider only node-pairs within a radius of r from each other in the underlying undirected graph (other criteria are possible), and there are on average d r such neighbors for a node, then there will be nd r node-pairs.”
18 “limited-information” problem Problem: In document corpora, many “unpopular” documents, i.e., documents with very few in-citations. Although the scarcity of contextual information makes them difficult to analyze, these documents are often the most important, since they tend to be harder for humans to find. This is especially true for new documents, which are likely unpopular because it takes time for others to notice and cite them, but often we are most interested in new documents. Analysis: A is only cited by B: A is a new paper, A 1 to A m seems to have the same similar with A But when consider the previous citations: intuitively, A m and A get “similarity” form B and B’ which is similar with each other, so the similarity of A m and A is larger that A 1 and A
19 “limited-information” problem Complementary Problem: We are interested in a general document C, and ask whether A should be included on a list of documents most similar to C, and A has limited information. On the one hand, A has only one in-citation, this might be “outlier” citation, It would be safer to consider only documents for which we have more information. On the other hand, we don’t want to eliminate unpopular documents from consideration or popular documents to be favored for every query. Strategy: Eliminate the |I(b)| and re-weight the final results where the constant 0<P<1 is a parameter adjustable by the end user.
20 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
21 4. Experiments Scientific research paper (homogeneous graph) Nodes: 278,628 papers Edges: 688,898 cross-references Two rough similarity baselines Citations: fraction of q’s citation also cited by p Titles: fraction of words in q’s title also in p’s title. Datasets: Algorithms are valuated by the average improvement (difference) to the baselines.
22 4. Experiments Scientific research paper (homogeneous graph) For 13,481 objects p, run SimRank and co-citation, select top N nodes Results:
23 4. Experiments Students and Courses (bipartite graph) Nodes: 1030 undergraduate students, average 40 courses per student Edges: 1030 transcripts Rough similarity baselines If two courses p and q are from the same department: rough similarity=1, otherwise 0 Datasets: Algorithms are valuated by the average improvement (difference) to the baselines.
24 4. Experiments Students and Courses (bipartite graph), 3193 trials Datasets:
25 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
26 5. Conclusion Advantages: Intuition is easy to understand Compute the all-pairs similarity problem: a few iterations to converge Combine the random walk thought Consider the entire topological structure, not just only for common neighbors: especially for “limited-information” problem Can combine with other similarity measures Disadvantages: Space: O(n 2 ) Runtime: O(Kn 2 d 2 ) Pruning strategy cut nodes Sometime will contradict to directly thought about similarity if just use SimRank Evaluation to SimRank:
27 Yuchen Bian Thank you! Q & A
28 4. Experiments Scientific research paper (homogeneous graph) For 13,481 objects p, run SimRank and co-citation, select top N nodes Results:
29 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) 2.lim(R k ) is unique 3.lim(R k )=s 4.0≤lim(R k )≤1 Pf: For 4, use induction Also 0<c<1,
30 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) Pf: For 1, use monotonicity of R k and Completeness Axiom R k is nondecreasing monotonic R k will converge to R R 0 is a lower bound of R
31 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method Jeh G, Widom J. SimRank: a measure of structural-context similarity, SIGKDD, 2002: R will uniquely converge to s (0≤s≤1). 1.lim(R k ) will exist (R k will converge) 2.lim(R k ) is unique 3.lim(R k )=s 4.0≤lim(R k )≤1 Pf: For 3,
32 Compute SimRank: 2. SimRank and other Versions Naive method: Iterative fix-point method R will uniquely converge to s (0≤s≤1). 2. lim(R k ) is unique Pf: Assume there were two solutions, s 1 (a,b) and s 2 (a,b) For any a,b pairs in V, let And let Special case: if a=b, s 1 (a,b) =s 2 (a,b), M=0 If a or b has no in-neighbors, s 1 (a,b) =s 2 (a,b)=0, M=0 Otherwise: Since 0<c<1, then M=0, the solution is unique
33 1. Motivation ----Similarity measure 2. SimRank and other Versions ----Basic definition ----Variant versions ----Compute SimRank 3. SimRank and Random Walk ----Expected-f meeting distance 4. Experiments 5. Conclusion Content
34 3. SimRank and Random Walk SimRank score s(a, b) measures how soon two random surfers are expected to meet at the same node if they started at nodes a and b and randomly walked the graph. Random walk with restart high iteration#, single-source O(n 3 ) to O(n 2 ) (top-k), all pairs
35 3. SimRank and Random Walk In a strongly connected graph Expected distance is that starting at u and end at v, and do not touch v except at the end (expected step to first get v ) Recursive version: Expected distance:
36 3. SimRank and Random Walk In a derived graph G 2 G 2 = (V 2,E 2 ) V 2 = V × V represents a pair (a, b) of nodes in G An edge from (a, b) to (c, d) exists in E 2 iff the edges and exist in G. EMD is the expected distance in G 2 from (a,b) to any singleton node (x,x): meet at the same node Expected meeting distance (EMD): t- a tour from (a,b) to (x,x) Be careful! G 2 is not strongly connected, the distance may be infinite Intuitively, there must be some relationship between EMD and SimRank score, if we consider from similarity (information) flow. a b c d f a b c d f c
37 3. SimRank and Random Walk In order to solve the “infinite EMD” problem and find a mapping function f from EMD to SimRank score EMD to infinite SimRank to 1 There should be a negative correlation from EMD to Simank. Expected-f meeting distance(EMD): If a=b, s’(a,b)=1; if no path to (x,x), s’(a,b)=0 At this time, we still cannot get some relationship between s and s’, also is the c here the same one with the SimRank score parameter?
38 3. SimRank and Random Walk The same idea with the recursive description with d(u,v) in G First step: from (a,b) to any our-neighbor pair Oz((a,b)) Suppose t’ is the new tour from Oz((a,b)) to (x,x), then l(t)=l(t’)+1 Expected-f meeting distance(EMD): At this time, we can see that :