Download presentation
Presentation is loading. Please wait.
1
Link Prediction & Content
Seminar Social Media Mining University UC3M Date May 2017 Lecturer Carlos Castillo Sources: Lilian Weng, Jacob Ratkiewicz, Nicola Perra, Bruno Gonçalves, Carlos Castillo, Francesco Bonchi, Rossano Schifanella, Filippo Menczer, Alessandro Flammini: The Role of Information Diffusion in the Evolution of Social Networks. In KDD 2013 [doi] Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco Who to follow and why: link prediction with explanations. In KDD 2014 [doi] Seth A. Myers and Jure Leskovec The bursty dynamics of the Twitter information network. In WWW 2014 [doi]
2
Extension: links and traces, with data from Y! Meme (2009-2010)
Lilian Weng, Jacob Ratkiewicz, Nicola Perra, Bruno Gonçalves, Carlos Castillo, Francesco Bonchi, Rossano Schifanella, Filippo Menczer, Alessandro Flammini: The Role of Information Diffusion in the Evolution of Social Networks. In KDD 2013
3
Y! Meme Dataset ~128k users ~3.5M links ~7M posts
Entire history available! Lilian Weng, Jacob Ratkiewicz, Nicola Perra, Bruno Gonçalves, Carlos Castillo, Francesco Bonchi, Rossano Schifanella, Filippo Menczer, Alessandro Flammini: The Role of Information Diffusion in the Evolution of Social Networks. In KDD 2013
4
Most slides on this section from talk by F
Most slides on this section from talk by F. Menczer:
5
Triadic closure with grandparent
Triadic closure with original poster Triadic closure with someone else
7
Many traffic shortcuts … can this happen by chance?
Let's take for instance grandparent links G Links labeled in creation order Probability of link being a “grandparent” link by chance: Where NG is the number of grandparents of the link at creation time, φ(.) the set of people followed by the creator of the link at link creation time
8
Expected and actual number of links
Expected number of links and variance if process is random Actual number of grandparent links SG By central limit theorem, the following should be normally distributed with mean 0 and variance 1:
9
z computed for different degrees
Notice z is in general a large number, very unlikely to come from a normal with mean 0 and variance 1. Observe also that at about 75 links, triadic closure starts taking a secondary role.
10
Effect of repeated exposure
The more posts we see from someone, the more likely we are of following her/him next This is consistent with previous observations regarding repeated exposure to online content
11
Link efficiency (Usage of link after creation)
Average number of posts seen through that link per unit of time after the link is created Each box shows data within lower and upper quartile. Whiskers represent the 99th percentile. The triangle and line in a box represent the median and mean, respectively. Note that the mean can fall outside the shown quantiles for skewed distributions. The gray area and the black line across the entire figure mark the interquartile range and the median of the measure across all links, respectively.
12
Link creations might happen by a mixture of reasons
How do we determine the relative weights of each element in the mixture? E.g. grandparent vs. random Likelihood of link being created if we're using strategy Ψ and the state of the graph is Θ Combined strategy: link to grandparent with probability p, random with probability (1-p):
13
Log likelihood For numerical stability, the log likelihood is computed
For instance for grandparent links, it is: To plot the log likelihood as a function of p, we exhaustively use several values of p and calculate this numerically
14
Link creations happen through a mixture of reasons
15
Assume all users create links using a mix of triadic closure and G+O, max likelihood estimate of params
16
Per-user assignments Assume you have p = <ptraffic, pstructure, prandom> (traffic=G+O, structure=Δ) Assume you have p' = <p'traffic, p'structure, p'random> Could you determine if a particular user is more likely to be using strategy p than p'?
17
Behavioral clusters Expectation-Maximization algorithm for clustering:
Assume there are k centroids Each centroid is a vector of three probabilities: <ptraffic, pstructure, prandom> (traffic=G+O, structure=Δ) Repeat: Assign person to most likely centroid Recompute centroids Number k can be determined by cross-validation
18
Clusters found Info: preference for traffic-related linking
>0.50 >0.05 Info: preference for traffic-related linking Cfrd: preference for friends but also random Random: preference for random
19
Clusters of users by most likely link creation strategy
Random Mixture Information oriented Casual friendship Friendship
21
In general ... In general, active, popular, influential users make an information network more efficient by creating “shortcuts” for information diffusion.
22
Link prediction with explanations
23
Link prediction with explanations
Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco Who to follow and why: link prediction with explanations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14)
24
Key idea Common identity and common bond theory:
Identity-based attachment holds when people join a community based on their interest in a well-defined common topic; Bond-based attachment is driven by personal social relations with other specific individuals.
25
Example network Sets next to nodes indicate interests (products purchased, hashtags used, keywords, etc.) Which links are identity-based and which ones are bond-based?
26
Example network (3 communities)
Blue links are bond-based Bond-based communities tend to have high density and reciprocal links Green and orange links are identity-based Identity-based communities tend to exhibit a clear directionality (leaders and followers)
27
Factors governing link behavior
Authority (influence held) Susceptibility (influence received) Social attitude (propensity to reciprocate links) Features adopted
28
Factors governing link behavior
29
Inference of generative model
Gray circles are observed (u,v) links, (u,f) features Dirichlet distributions are used for sparsity
30
Parameters
31
Once a model has been fitted ...
Recommend social link (u,v) if both are members of the same social community Recommend topical link (u,v) if u is interested in the topic and v is an authority in the topic
32
Experimental results: baselines
JSVD (joint-SVD): approx. matrix X'≈X=[E F] E contains (user, user) pairs for links F contains (user, feature) pairs for interests CNF (common neighbors and features) Adamic/Adar
33
Results (two datasets)
Twitter (n=81K, m=1.7M) Flickr (n=80K, m=14M)
34
Temporal dimension
35
Temporal dimension “Bursty” dynamic in which cascades of information create new links Communities become more dense Communities become more topically cohesive Myers & Leskovec: The bursty dynamics of the Twitter information network. WWW 2014
36
Followers gained, followers lost, number of retweets, and number of tweets all scale with the indegree (number of followers) of a user. E.g.: in a given month a user of degree 100 tends to gain 10 and lose 3 followers.
37
For users with 1000-2000 followers
Got retweeted? => will get followers Tweet too little or too much => lose followers
38
Three case studies 266K followers: burst in retweets => burst in followers 218K followers: burst in retweets => nothing 112K followers: no tweets, but followed/unfollowed
39
Observations of ego-networks
Users with 2K or more followers Instances in which a retweet burst was followed by a follow burst (or unfollow burst) Measure properties in their ego network: Tweet similarity: average content similarity of (v,u) for all v such that follows(v,u) Tweet coherence: average content similarity across all (v,w) pairs such that follows(v,u) and follows(w,u) Connected components Edge density
40
Results in ego-nets Content similarity (to leader) and coherence (among followers) increases. Weakly connected components increase; edge density increases slightly.
41
What causes a large follower burst?
The perfect scenario for a large follower burst is that a person's content reaches a large, new, and compatible audience
42
What does it mean a compatible audience? What is the scale?
# Followers Probability Different users have different characteristic scales sim(v,u) is lognormal-distributed for follows(v,u)
43
Normalized similarity of v as a potential follower of u
44
Yuv helps predict followers
Probability that v follows u Probability increases exponentially with normalized similarity
45
Building a predictor Basic building block, estimates prob. of following given similarity Set of nodes who have seen content by u Set of nodes that are followers of followers of u
46
Experimental results Task: given a retweet burst (cascade)
Predict whether there will be a follower burst Method AUC Myers & Leskovec 2014 0.52 Number of retweet exposures 0.38 Number of retweets 0.33 Number of followers 0.22 Random 0.21
47
Conclusions Link formation is a complex process
Driven by triadic closure Driven by shortcut formation Driven by topical communities Tends to occur in bursts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.