Information Diffusion in Social Media Kristina Lerman University of Southern California Access to data allows us to ask new questions, empirically measure effects CS 599: Social Media Analysis University of Southern California
Information diffusion on Twitter follower graph
Diffusion on networks The spread of disease, ideas, behaviors, … on a network can be described as a contagion process where an active node (infected/informed/adopted) activates its non-active neighbors with some probability … creates a cascade on a network How large do cascades become? What determines their growth?
Diffusion models Complex response: infection requires multiple exposures. Non-monotonic exposure response Exposure response function Threshold model Complex contagion 1 1 infection prob. infection prob. fiki number infected neighbors number infected neighbors
Epidemic diffusion model Infected nodes propagate contagion to susceptible neighbors with probability m (transmissibility or virality of contagion) Exposure response function 1 A popular metaphor used to study information spread infection prob. infected exposed number infected neighbors
Epidemic threshold Epidemic threshold t: For m < t, localized cascades (epidemic dies out) For m > t, global cascades Epidemic threshold depends on topology only: largest eigenvalue of adjacency matrix of the network True for any network Cascade size N Epidemic threshold Transmissibility, m
Daniel M Romero, Brendan Meeder and Jon Kleinberg Differences in the Mechanics of Information Diffusion across Topics: Idioms, Political Hashtags and Complex Contagion on Twitter Daniel M Romero, Brendan Meeder and Jon Kleinberg Presentation by Aswin Rajkumar
Motivation and Contribution Information Diffusion and Topics - Eg: Controversial political topics have high information diffusion. - Scientific study of the variation in diffusion mechanics across topics. Contribution of the paper - Empirical analysis of real world data - Observation that the mechanics of spread can be defined using two variables, stickiness and persistence. - Confirmation of sociological theories found in the offline world – diffusion of innovations
The Study – How? Twitter – Dataset, a snapshot covering a large number of tweets over a period of several months (Aug 09 to Jan 10) 3 billion messages from over 60 million users #Hashtag – Tokens, Top 500 Hashtags @Mention – Network, Neighbor Set t mentions from X to Y, t = 3 Why? Shows X’s attention to Y.
The Study – What? Adoption and Spread of Hashtags - Diffusion Topics – Politics, Celebrity, Music, Movies, Games, Idioms, Sports and Technology Stickiness - the probability that a piece of information will pass from a person who knows or mentions it to another person who is exposed to it. Persistence and “Complex Contagion”, a principle from sociology. Persistence - the relative extent to which repeated exposures to a hashtag continue to have significant marginal effects on adoption. Rate of decay.
Complex Contagion Complex contagion refers to the phenomenon in social networks in which multiple sources of exposure to an innovation are required before an individual adopts the change of behavior. - Wikipedia
P(K) Stickiness Persistence
Analysis – Stickiness and Persistence Take the top 500 hashtags Classify them into 8 topics or categories Construct p(k) curves for each hashtag and average them separately within each category Compare the shapes Political Hashtags – High Stickiness and Persistence Twitter Idioms – High Stickiness, Low Persistence #mw2, #mafiawars #lost, #newmoon #mj, #brazilwantsjb #pandora, #thisiswar #obama, #hcr #cricket, #nhl #photoshop, #digg
Twitter Idioms #cantlivewithout #musicmonday #iloveitwhen #followfriday
Analysis – Subgraph Structure Interconnections among early adopters Subgraphs for political hashtags - High in-degree, large number of triangles. Tie Strength – Strong, Weak. Credit : Bridge-talent.com
Exposure Curve - Definitions K-exposed – A user is k-exposed to a tag h if he has not used h, but is connected to k other users who have used h in the past. What’s the probability that a k-exposed user u will use hashtag h in the future? 1) Ordinal Time Estimate Probability of a k-exposed user u using hashtag h before becoming k+1 exposed. P(k) = I(k) / E(k) E(k) – number of k-exposed users I(k) – number of k-exposed users who used h before becoming k+1 exposed. 2) Snapshot Estimate Similar, but based on time. E(k) – numer of users k-exposed at t1. I(k) – number of users k-exposed at t1 and used h before t2 P(k) = I(k) / E(k) -> Exposure Curve
Comparison Parameters Persistence Parameter F(P) = A(P) / R(P) A(P) – Area under P curve. R(P) – Area under the rectangle of length K and height max(P(k)) Curve comparisons Increases rapidly and falls vs Increases slowly and saturates Increases slowly and saturates vs Rapid Increase Stickiness Parameter M(P) = Max(P(K))
Plots F(P) = A(P) / R(P) -> Persistence Parameter M(P) = Max(P(K)) -> Stickiness Parameter
Improvements and Related Work @Mention network is not very representative. Also, attention should be from Y to X. Considers only average persistence. Median and variance should be analyzed too. Other types of networks. Eg: Blogs. [Gruhl, Guha, Nowell, Tomkins - Information Diffusion through Blogspace]. Influence on Online Behavior. Eg: Games. [Woo, Kang, Kim – The Contagion of Malicious Behaviors in Online Games] Network structure is dynamic in real life. [Bano, Holthoefer, Wang, Moreno, Bailon – Diffusion Dynamics with Changing Network Composition ]
Conclusion Hashtags of different topics exhibit different mechanics of spread. Politically controversial hashtags have the highest diffusion. Information diffusion depends on the probability of users adopting a hashtag after repeated exposure to it. Depends on the magnitude of the probabilities as well as the rate of decay Confirms the sociological theory of complex contagion Higher in-degree and stronger ties results in better spread.
Questions?
What Stops Social Epidemics? (Ver Steeg et al.) Why do information cascades in social media Grow quickly initially But remain much smaller than predicted by epidemic models? Information cascades differ from viral contagion: Response to repeated exposure is important on Digg (and Twitter) Drastically alters predictions about size of epidemics
Social news: Users submit or vote for (infected by) news stories Social network Users follow ‘friends’ to see Stories friends submit Stories friends vote for Trending stories Digg promotes most popular stories to its Top News page
How large are cascades in social media? Number of people who share a message (with a URL) Digg Twitter 3.5K URLs 258K users 1.7M edges 70K URLs 700K users 36M edges Most cascades less than 1% of total network size! [Lerman et al. “Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs” arXiv:1202.3162]
Why are these cascades so small? Standard model of epidemic growth (Heterogenous mean field theory, SIR model, same degree distribution as Digg) Most cascades fall in this range Standard epidemic model, how should cascade sizes look as a function of lambda? First, we should have a threshold, predicted by several models. Now, look at cascade sizes, what does this tell us about transmissibility of our stories. (!) Transmissibility, m Transmissibility of almost all Digg stories fall within width of this line?!
Maybe graph structure is responsible? ← Mean field prediction (same degree dist.) ← Simulated cascades on a random graph with same degree dist. Simulated cascades on the observed Digg graph epidemic threshold First explanation might be graph structure. Mean field neglects rich cluster structure and finite graph effects. In this graph, we changed back from a log-log to see these small differences. We simulated cascades on graphs to see how structure affected cascade size. Finite, random graph, same dd. Slightly reduces cascade sizes. On a finite graph, unavoidable loops and clustering. On Digg graph, there is even more structure. Reduces threshold, smaller cascades. And yet, still doesn’t jibe with our observed cascade sizes… kL: We also observe the existence of a threshold of transmissibility both in case of random and actual graph, below which the cascades die out. Clustering leads to the lowering of this threshold. We observe that clustering does limit the size of the cascades with the cascade sizes on the random graph being bigger than in the actual digg graph with the same transmissibility. The golden line shows the expected size of cascades Using the heterogeneous mean field (HMF) theory which predicts the cascade size taking the log tail degree distribution into account in the limit of large graphs . For the random graph, the epidemic threshold and the cascade sizes are very close to what is predicted by the heterogeneous mean field theory on epidemic spread, shown by the golden line Because the randomized graph is still finite, some clustering inevitably occurs (it has a clustering coefficient of about 0.02), decreasing the cascade size from the HMF prediction. Transmissibility m clustering reduces epidemic threshold and cascade size, but not enough!
What about the spreading mechanism? Infected Not Infected ? If not structure, maybe the spreading mechanism has some effect? Very simple mechanism, but not without choices We have to decide what to do about repeat exposure. Is this a big effect?
Are repeat exposures a big effect? Yes, more than half of the users are exposed to the same information more than once! On this graph we have the probability of having exactly n friends voting. It’s on a log-log plot with a longish tail. Significant probability of having 10 friends voting. More than half are exposed more than once. Clearly, repeat exposures are important. So how do people respond?
How do people respond to repeated exposure? Exposure response Not much. We have similar results for Twitter ------- Also noted by Romero, et al, WWW 2011 The answer is not much. If we look at the probability to vote on a story given that n friends have voted one it, you see that having 30 friends voting doesn’t make you significantly more likely to vote than if you have 1 friend voting. In the ICM, for example, if the probability after one friend voting is lambda, there is an independent probability lambda for each subsequent friend voting. We’ve also noticed a deviation from ICM for our data on Twitter, and a similar observation has been reported. But, now the important question; what is the effect of this observation?
Big consequences for cascade growth Most people are exposed to a story more than once Repeated exposures have little effect Growth of epidemics is severely curtailed (especially compared to Ind. Cascade Model)
Weak response to repeated exposures suppresses outbreaks Take effect of repeat exposure into account: Actual Digg cascades Result of simulations Epidemic threshold unchanged Back to our graph of cascade size as a function of transmissibility. Now, we simulate cascades again, only taking into account the graph structure and fact that repeated exposure doesn’t increase prob vote Predicts threshold And match cascade size perfectly, orders of magnitude smaller than viral epidemic predictions. That’s really nice, essentially a zero parameter fit. But we can explain more than just cascade size. λ* m*, Transmissibility
How Limited Visibility and Divided Attention Constrain Social Contagion (Hodas & Lerman, 2012) Questions How do people respond to exposures to information by friends on social media? What role does content play in information diffusion? Findings Users have finite ability to process information Most recently received messages are retweeted, the rest are overlooked Highly connected users (hubs) are far less likely to retweet any message they receive than poorly connected people Reduced susceptibility of hubs to “infections” explains why cascades are small
Mechanics of information diffusion User must see an item and find it interesting before he/she can spread it (e.g., by retweeting it, voting for or liking it, …) See? Interesting? Respond Cognitive Tastes Retweet Interface Content
Cognitive factors: Position bias People pay more attention to items at the top of the screen or a list of items [Payne, The Art of Asking Questions (1951) ] [Buscher et al, CHI’09] [Counts & Fisher ICWSM’11] … limits how far down the list/page the user navigates
Measuring position bias Amazon Mechanical Turk experiments Users were asked to recommend science stories We controlled the order stories were presented to users Position bias: stories at top list positions received more recommendations Can we control user attention – through story ordering – so as to improve outcomes of peer recommendation? [Lerman & Hogg (2014) “Leveraging position bias to improve peer recommendation” in Plos One.
Position bias creates a “limited attention” prob. to view post position post visibility new post at top of user’s screen post near the top is most likely to be seen showVisibility[1, BaseStyle -> 12]
Position bias creates a “limited attention” … some time later: newer posts appear at the top prob. to view post position post is less likely to be seen showVisibility[21, BaseStyle -> 12]
Position bias and number of friends few friends many friends … some time later: newer posts appear at the top post is less likely to be seen showVisibility[21, BaseStyle -> 12] same age post is even less visible to a highly connected user
Friends are a source of distraction users with more friends are more active users with more friends are distracted by more content nf Limited attention makes hubs less susceptible to ‘infection’
Users retweet most recent messages high connectivity users “Time Response Function” low connectivity users Users retweet newest messages (at the top of their screen) Hubs are much less likely to retweet an older message
Does content matter? visibility probability to tweet a message “virality” Estimated virality
Do “viral” messages spread farther? ln(“virality”) … “viral” messages can reach many or few people
How do people respond to multiple exposures? Exposure response Number of tweeting friends Is this evidence for complex contagion?
“Complex contagion”- artifact of heterogeneity low connectivity users high connectivity users Breaking down exposure response by different sub-populations, separated according to number of friends they follow, reveals simple, monotonic response
Summary “A meme is not a virus” Information spread ≠ Disease spread Big consequences for modeling information spread in social media Highly connected people (hubs) act as fire walls to information spread They have a hard time finding messages in their stream People have a finite capacity to process information; the more messages they receive, the less likely they are to respond to any given one Information overload actually reduces the size of information cascades