Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)
Which pair of nodes {i,j} should be connected? Variant: node i is given Friend suggestion in Facebook Should Facebook suggest Alice to Bob as a future friend? Bob Alice
Which pair of nodes {i,j} should be connected? Variant: node i is given Alice Bob Charlie Movie recommendation in Netflix Should Netflix suggest this movie to Alice?
Paper #2 Paper #1 SVM margin maximum classification paper-has-word paper-cites-paper paper-has-word large scale Is paper #1 relevant to the query “SVM”? Relevance search in databases
Classifying Hand Written Digits Are these two digits the same? Zhu et al, 2003
Link prediction problems rely on Homophily similar nodes are more likely to be connected. Use a graph-based proximity measure between the query node q and other nodes And now predict a link between q and the highest ranking node which is not already connected.
Predict link between nodes With the minimum number of hops With max common neighbors (length 2 paths) 8 followers 1000 followers Prolific common friends Less evidence Less prolific Much more evidence Alice Bob Charlie The Adamic/Adar score gives more weight to low degree common neighbors.
Predict link between nodes With the minimum number of hops With max common neighbors (length 2 paths) With larger Adamic/Adar With more short paths (e.g. length 3 paths ) …
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 How do we justify these observations? Especially if the graph is sparse
Link prediction problems rely on Homophily similar nodes are more likely to be connected. Different heuristics are trying to predict this underlying or “latent” nearness of nodes. Easier to encode this by using a latent-space model for generating links.
11 Nodes are uniformly distributed in a latent space The problem of link prediction is to find the nearest neighbor who is not currently linked to the node. Equivalent to inferring distances in the latent space Raftery et al.’s Model: Unit volume universe Points close in this space are more likely to be connected.
12 1 ½ Higher probability of linking Two sources of randomness Point positions: uniform in D dimensional space Linkage probability: logistic with parameters α, r α, r and D are known radius r α determines the steepness
13 Generative model Link Prediction Heuristics node a Most likely neighbor of node i ? node b Compare A few properties Can justify the empirical observations We also offer some new prediction algorithms
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse
Pr 2 (i,j) = Pr(common neighbor|d ij ) Product of two logistic probabilities, integrated over a volume determined by d ij As α ∞ Logistic Step function Much easier to analyze! i j
16 Everyone has same radius r i j Empirical Bernstein Bounds on distance V(r)=volume of radius r in D dims η =Number of common neighbors Unit volume universe
OPT = node closest to i MAX = node with max common neighbors with i Theorem: d OPT ≤ d MAX ≤ d OPT + 2[ ε/V(1)] 1/D ε = c 1 (var N /N) ½ + c 2 /(N -1 ) D=dimensionality w.h.p Common neighbors is an asymptotically optimal heuristic as N ∞
Node k has radius r k. i k if d ik ≤ r k (Directed graph) r k captures popularity of node k 18 i k j Type 1: i k j riri rjrj A(r i, r j,d ij ) Type 2: i k j i k j rkrk rkrk A(r k, r k,d ij )
i j k η 1 ~ Bin[N 1, A(r 1, r 1, d ij )] η 2 ~ Bin[N 2, A(r 2, r 2, d ij )] Example graph: N 1 nodes of radius r 1 and N 2 nodes of radius r 2 r 1 << r 2 Maximize Pr[ η 1, η 2 | d ij ] = product of two binomials w(r 1 ) E[ η 1 |d*] + w(r 2 ) E[ η 2 |d*] = w(r 1 ) η 1 + w(r 2 ) η 2 RHS ↑ LHS ↑ d* ↓
{ Variance Jacobian Small variance Presence is more surprising r is close to max radius Small variance Absence is more surprising Adamic/Adar 1/r Real world graphs generally fall in this range
Q r = Fraction of nodes with radius ≤ r which are common neighbors T R = Fraction of nodes with radius ≥ R which are common neighbors Number of common neighbors of a given radius Large Q r small d ij Small T R large d ij r
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse
Common neighbors = 2 hop paths Analysis of longer paths: two components 1. Bounding E( η l | d ij ). [η l = # l hop paths] Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. η l ≈ E( η l | d ij ) Triangulation
Common neighbors = 2 hop paths Analysis of longer paths: two components 1. Bounding E( η l | d ij ) [η l = # l hop paths] Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. η l ≈ E( η l | d ij ) Bounded dependence of η l on position of each node Can use McDiarmid’s inequality to bound | η l - E( η l | d ij )|
Bound d ij as a function of η l using McDiarmid’s inequality. For l’ ≥ l we need η l’ >> η l to obtain similar bounds Also, we can obtain much tighter bounds for long paths if shorter paths are known to exist.
1 ½ Factor ¼ weak bound for Logistic Can be made tighter, as logistic approaches the step function.
Three key ingredients 1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg Triangle inequality holds necessary to extend to l hop paths 3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance
RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction.
Combine bounds from different radii But there might not be enough data to obtain individual bounds from each radius New sweep estimator Q r = Fraction of nodes w. radius ≤ r, which are common neighbors. Higher Q r smaller d ij w.h.p
Q r = Fraction of nodes w. radius ≤ r, which are common neighbors larger Q r smaller d ij w.h.p T R : = Fraction of nodes w. radius ≥ R, which are common neighbors. Smaller T R large d ij w.h.p