Download presentation
Presentation is loading. Please wait.
Published bySabrina Bradley Modified over 9 years ago
1
Glen Jeh & Jennifer Widom KDD 2002
2
Many applications require a measure of “similarity” between objects. Web search Shopping Recommendations Search for “Related Works” among scientific papers But “similarity” may be domain-dependent. Can we define a generic model for similarity?
3
What do all these applications have in common? data set of objects linked by a set of relations. Then, a generic concept of similarity is structural-context similarity. “Two objects are similar if the relate to similar objects.” Recall automorphic equivalence: “Two objects are equivalent if the relate to equivalent objects.”
4
Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural- context similarity.
5
Directed Graph G = (V,E) V = set of objects E = set of unweighted edges Edge (u,v) exists if there is an relation u v I(v) = set of in-neighbors of vertex v O(v) = set of out-neighbors of vertex v
6
Recursive Model “Two objects are similar if they are referenced by similar objects” That is, a ~ b if c a and d b, and c ~ d An object is equivalent to itself (score = 1) Example 1.ProfA ~ ProfB because both are referenced by Univ. 2.StudentA ~ StudentB because they are referenced by similar nodes {ProfA,ProfB}
7
s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b s(a,b) is in the range [0, 1] If a=b, then s(a,b) = 1 If a≠b, C is a constant, 0 < C < 1 if I(a) or I(b) = ∅, then s(a,b) = 0
8
X is identical to itself: s(x,x) = 1 Since we have x a and x b, should s(a,b) = 1 also? If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1. But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1. Therefore, the constant C expresses our limited confidence or decay with distance: s(a,b) = C ∙ average similarity of (I(a), I(b)) x b a
9
Given graph G, define G 2 =(V 2, E 2 ) where V 2 =V x V. Each vertex in V 2 is a pair of vertices in V. E 2 : (a,b) (c,d) in G 2 iff a c and b d in G Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.
10
SimRank score for a vertex (a,b) in G 2 = similarity between a and b in G. The source of similarity is self-vertices, like (Univ, Univ). Then, similarity propagates along pair-paths in G2, away from the sources. Note that values decrease away from (Univ, Univ)
11
Bipartite: 2 types of objects Example: Buyers and Items
12
Two types of similarity: Two buyers are similar if they buy the similar items Out-neighbors of buyers are relevant: Two items are similar if they are bought by similar buyers In-neighbors of items are relevant: In general, we can use I(.) and/or O(.) for any graph
13
Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1} SimRank compares each course of A with each course of B But intuitively we just want the best matching pairs: s(Eng1 A,Eng1 B ), s(Math1A,Math1B), etc. Solution: Two steps Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction: Min: Final s(A,B) is the smaller of s A (A,B) and s B (A,B) [weakest link]
14
R k (a,b) = estimate of SimRank after k iterations. Initialization: Iteration: R k (a,b) is the similarity that has flowed a distance k away from the sources. R k values are non-decreasing as k increases. We can prove that R k (a,b) converges to s(a,b)
15
Space complexity : O(n 2 ) to store R k (a,b) Time complexity : O(kn 2 d 2 ), d 2 is the average of |I(a)||I(b)| over all vertex pairs (a,b) To improve performance, we can prune G 2 : Idea: vertices that are far apart should have very low similarity. We can approximate it as 0. Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G 2. space complexity: O(nd r ) time complexity: O(Knd r d 2 ), dr = avg. number of neighbors within radius r.
16
SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards Background: Basic Forward Random Walk Motion is in discrete steps, using edges of the graph. Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors. Given adjacency matrix A, the probability of walking from x to y is p xy = a xy /O(x). Random Walk as a Markov Process Initial location is described by the prob. distribution vector π (0) Prob. of being at y at time 1 :
17
Given adjacency matrix A: The forward and backward transition matrices:
18
Probability of walking backwards to x in one step: Two walkers meet at x if they start at a and b, and if one goes x a and the other goes x b, respectively. s x (a,b) = P(meeting at x) = π(a,b) p(x a) p(x b) s(a,b) = P(meeting) = Σ x π(a,b) p(x a) p(x b) If they start together, they have met, so s (0) xy = 1 if i = j ; 0 otherwise [identity matrix] Then
19
Two data sets ResearchIndex (www.researchindex.com) a corpus of scientific research papers 688,898 cross-reference among 278,628 papers Student’s transcripts 1030 undergraduate students in the School of Engineering at Stanford University Each transcript lists all course that the student has taken so far (average: 40 courses/student)
20
Problem: Difficult to know what is the “correct” similarity between items. Solution: Define a rough domain-specific metric σ(p,q): For scientific papers, we have two versions: σ C (p,q) = fraction of q’s citations also cited by p σ T (p,q) = fraction of words in q’s title also in p’s title For university courses: σ D (p,q) = 1 if p, q are in the same department, else 0
21
Run the similarity algorithms: SimRank (naïve, pruned, minmax) Co-Citation For each object p and algorithm A, form a set top A,N (p) of the N objects most similar to p. For each q ∈ top A,N (p), compute σ(p,q). Return the average σ A,N (p) over all q.
22
Setup Used bipartite SimRank, only considering in- neighbors (validation uses out-neighbors) N ∈ {5, 10, …, 45, 50} Results Not very sensitive to decay factors C 1 and C 2 Pruning the search radius had little effort on rank order of scores.
24
Setup Bipartite domain N ∈ {5, 10} Results Min-Max version of SimRank performed the best Not very sensitive to decay factors C1 and C2
25
Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.
26
Defined a recursive model of structural similarity between objects in a network Mathematically formulated SimRank based on the recursive concept Presented a convergent algorithm to compute SimRank Described a random-walk interpretation of SimRank equations and scores Experimentally validated SimRank over two real data sets
27
O(n 2 ) is large; scalability needs to be improved. s(a,b) only includes contributions for paths when a and b are the same distance from some x. What if the distances are offset (total is odd)? As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)! Addressed partially by Minimax method
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.