Presentation is loading. Please wait.

Presentation is loading. Please wait.

Glen Jeh & Jennifer Widom KDD 2002.  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.

Similar presentations


Presentation on theme: "Glen Jeh & Jennifer Widom KDD 2002.  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search."— Presentation transcript:

1 Glen Jeh & Jennifer Widom KDD 2002

2  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search for “Related Works” among scientific papers  But “similarity” may be domain-dependent.  Can we define a generic model for similarity?

3  What do all these applications have in common?  data set of objects linked by a set of relations.  Then, a generic concept of similarity is structural-context similarity.  “Two objects are similar if the relate to similar objects.”  Recall automorphic equivalence:  “Two objects are equivalent if the relate to equivalent objects.”

4  Given a Graph G = (V, E), for each pair of vertices a,b ∈ V, compute a similarity (ranking) score s(a,b) based on the concept of structural- context similarity.

5  Directed Graph G = (V,E)  V = set of objects  E = set of unweighted edges  Edge (u,v) exists if there is an relation u  v  I(v) = set of in-neighbors of vertex v  O(v) = set of out-neighbors of vertex v

6  Recursive Model  “Two objects are similar if they are referenced by similar objects”  That is, a ~ b if  c  a and d  b, and  c ~ d  An object is equivalent to itself (score = 1)  Example 1.ProfA ~ ProfB because both are referenced by Univ. 2.StudentA ~ StudentB because they are referenced by similar nodes {ProfA,ProfB}

7  s(a,b) = similarity between a and b = average similarity between in-neighbors of a and in-neighbors of b  s(a,b) is in the range [0, 1]  If a=b, then s(a,b) = 1  If a≠b,  C is a constant, 0 < C < 1  if I(a) or I(b) = ∅, then s(a,b) = 0

8  X is identical to itself: s(x,x) = 1  Since we have x  a and x  b, should s(a,b) = 1 also?  If the graph represented all the information about x, a, and b, then s(a,b) would ideally = 1.  But, in reality the graph does not describe everything about them, so we expect s(a,b) < 1.  Therefore, the constant C expresses our limited confidence or decay with distance: s(a,b) = C ∙ average similarity of (I(a), I(b)) x b a

9  Given graph G, define G 2 =(V 2, E 2 ) where  V 2 =V x V. Each vertex in V 2 is a pair of vertices in V.  E 2 : (a,b)  (c,d) in G 2 iff a  c and b  d in G  Since similarity scores are symmetric, (a,b) and (b,a) are merged into a single vertex.

10  SimRank score for a vertex (a,b) in G 2 = similarity between a and b in G.  The source of similarity is self-vertices, like (Univ, Univ).  Then, similarity propagates along pair-paths in G2, away from the sources.  Note that values decrease away from (Univ, Univ)

11  Bipartite: 2 types of objects  Example: Buyers and Items

12  Two types of similarity:  Two buyers are similar if they buy the similar items  Out-neighbors of buyers are relevant:  Two items are similar if they are bought by similar buyers  In-neighbors of items are relevant:  In general, we can use I(.) and/or O(.) for any graph

13  Motivation: Two students A and B take the same courses: {Eng1, Math1, Chem1, Hist1}  SimRank compares each course of A with each course of B  But intuitively we just want the best matching pairs: s(Eng1 A,Eng1 B ), s(Math1A,Math1B), etc.  Solution: Two steps  Max: Pair each neighbor of A with only its most similar neighbor of B. Do the same in the other direction: Min: Final s(A,B) is the smaller of s A (A,B) and s B (A,B) [weakest link]

14  R k (a,b) = estimate of SimRank after k iterations.  Initialization:  Iteration:  R k (a,b) is the similarity that has flowed a distance k away from the sources. R k values are non-decreasing as k increases.  We can prove that R k (a,b) converges to s(a,b)

15  Space complexity : O(n 2 ) to store R k (a,b)  Time complexity : O(kn 2 d 2 ), d 2 is the average of |I(a)||I(b)| over all vertex pairs (a,b)  To improve performance, we can prune G 2 :  Idea: vertices that are far apart should have very low similarity. We can approximate it as 0.  Select a radius r. If vertex-pair (a,b) cannot meet in less than r steps, remove it from the graph G 2.  space complexity: O(nd r )  time complexity: O(Knd r d 2 ), dr = avg. number of neighbors within radius r.

16  SimRank s(a,b) measures how soon two random surfers are expected to meet at the same node if they start at nodes a and b and randomly walk the graph backwards  Background: Basic Forward Random Walk  Motion is in discrete steps, using edges of the graph.  Each time step, there is an equal probability of moving from your current vertex to one of your out-neighbors.  Given adjacency matrix A, the probability of walking from x to y is p xy = a xy /O(x).  Random Walk as a Markov Process  Initial location is described by the prob. distribution vector π (0)  Prob. of being at y at time 1 :

17  Given adjacency matrix A:  The forward and backward transition matrices:

18  Probability of walking backwards to x in one step:  Two walkers meet at x if they start at a and b, and if one goes x  a and the other goes x  b, respectively. s x (a,b) = P(meeting at x) = π(a,b) p(x  a) p(x  b) s(a,b) = P(meeting) = Σ x π(a,b) p(x  a) p(x  b)  If they start together, they have met, so s (0) xy = 1 if i = j ; 0 otherwise [identity matrix]  Then

19  Two data sets  ResearchIndex (www.researchindex.com)  a corpus of scientific research papers  688,898 cross-reference among 278,628 papers  Student’s transcripts  1030 undergraduate students in the School of Engineering at Stanford University  Each transcript lists all course that the student has taken so far (average: 40 courses/student)

20  Problem: Difficult to know what is the “correct” similarity between items.  Solution: Define a rough domain-specific metric σ(p,q):  For scientific papers, we have two versions: σ C (p,q) = fraction of q’s citations also cited by p σ T (p,q) = fraction of words in q’s title also in p’s title  For university courses: σ D (p,q) = 1 if p, q are in the same department, else 0

21  Run the similarity algorithms:  SimRank (naïve, pruned, minmax)  Co-Citation  For each object p and algorithm A, form a set top A,N (p) of the N objects most similar to p.  For each q ∈ top A,N (p), compute σ(p,q).  Return the average σ A,N (p) over all q.

22  Setup  Used bipartite SimRank, only considering in- neighbors (validation uses out-neighbors)  N ∈ {5, 10, …, 45, 50}  Results  Not very sensitive to decay factors C 1 and C 2  Pruning the search radius had little effort on rank order of scores.

23

24  Setup  Bipartite domain  N ∈ {5, 10}  Results  Min-Max version of SimRank performed the best  Not very sensitive to decay factors C1 and C2

25 Co-citation scores are very poor (=0.161 for N=5, and =0.147 for N=10), so are not shown in the graph.

26  Defined a recursive model of structural similarity between objects in a network  Mathematically formulated SimRank based on the recursive concept  Presented a convergent algorithm to compute SimRank  Described a random-walk interpretation of SimRank equations and scores  Experimentally validated SimRank over two real data sets

27  O(n 2 ) is large; scalability needs to be improved.  s(a,b) only includes contributions for paths when a and b are the same distance from some x. What if the distances are offset (total is odd)?  As |I(a)| and |I(b)| increase, SimRank decreases, even if I(a) = I(b)!  Addressed partially by Minimax method


Download ppt "Glen Jeh & Jennifer Widom KDD 2002.  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search."

Similar presentations


Ads by Google