Download presentation
Presentation is loading. Please wait.
Published byLindsay Thornton Modified over 9 years ago
1
Information Diffusion Mary McGlohon CMU 10-802 3/23/10
2
Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing
3
Epidemiological: SIS Susceptible, Infected, Susceptible ▫Infected for t I timesteps ▫While infected, transmits with probability ▫After t I steps, returns to susceptible
4
Epidemiological: SIR Susceptible, Infected, Removed ▫Infected for t I timesteps ▫While infected, transmits with probability ▫After t I steps, goes to removed/recovered
5
Epidemiological: SIRS Susceptible, Infected, Removed, Susceptible ▫Combination of SIS+SIR ▫After t I steps, goes to removed/recovered ▫After t R steps, returns to susceptible
6
Epidemiological: Networks Historically, SIS/SIR assumed a person could infect anybody else, full clique. There is an epidemic threshold in SIS. For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani] ▫(But not for PL networks with high clustering coefficients [Egu´ıluz and Klemm])
7
Threshold Models Each node in network has weighted threshold If adopted neighbors reaches threshold, the node adopts.
8
Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing
9
Info Diffusion in Blogs D. Gruhl, R. Guha, Liben D. Nowell, A. Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004). Goal: How do topics trend in blogs, and how can we model diffusion of topics?
10
Info Diffusion in Blogs Data: Crawled 11K blogs, 400K posts. Found 34o topics: ▫apple arianna ashcroft astronaut blair boykin bustamante chibi china davis diana farfarello guantanamo harvard kazaa longhorn schwarzenegger udell siegfried wildfires zidane gizmodo microsoft saddam
11
Info Diffusion in Blogs Topics = Chatter + Spikes ▫Chatter: Alzheimer ▫Spike: Chibi ▫Spiky Chatter: Microsoft
12
Info Diffusion in Blogs Modeled as SIR ▫Some set of authors is infected to write about a topic ▫Then propagate, as others write new posts on that topic ▫Measure the topic over time and other properties Fit using EM ▫Compute probability of propagation along each edge
13
Info Diffusion in Blogs Validation: ▫Synthetic Used modified Erdos-Renyi graph, created propagation Found that EM was able to identify transmission of most edges ▫Real Found “internet-only” topics Looked at most highly ranked expected transmission links, identified a real link in 90% of cases
14
Info Diffusion in Blogs Limitations of SIR ▫No multiple postings ▫No “stickiness”, which topics resonate with whom ▫No time limiting factor in topics ▫“Closed world assumption” No outside influences after initial infection
15
Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing
16
Cascades in Blogs Jure Leskovec, Mary Mcglohon, Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007) Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?
17
Cascades in Blogs Data: ▫Gathered from August-September 2005 ▫Used set of 44,362 blogs, 2.4 million posts ▫245,404 blog-to-blog links 17 Time [1 day] Number of posts Jul 4 Aug 1 Sep 29
18
Cascades in Blogs 18 Blogosphere B1B1 B2B2 B4B4 B3B3 Cascades d e b c e a a b c d e “Star” “Chain” What is the timing of links? What are cascade sizes? What are cascade shapes?
19
19 Cascades in Blogs What is the timing of links? Does popularity decay at a constant rate? With an exponential (“half life”)? Linear-linear scaleLog-linear scaleLog-log scale
20
Cascades in Blogs Observation: The probability that a post written at time t p acquires a link at time t p + Δ is: p(t p + Δ ) ∝ Δ -1.5 20 log(days after post) log( # in-links) slope=-1.5 (Linear-linear scale)
21
21 Cascades in Blogs How are cascade sizes distributed? Geometric distribution? Linear-linear scaleLog-linear scaleLog-log scale d e b c e a
22
Cascades in Blogs Q: What size distribution do cascades follow? Are large cascades frequent? Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution: p(n) ∝ n -2 22 log(Cascade size) (# of nodes) log(Count) slope=-2 d e b c e a
23
23 Cascades in Blogs How are cascade shapes distributed? More stars? More chains? d e b c e a
24
log(Size) of chain (# nodes) log(Count) a=-8.5 log(Size) of star (# nodes) log(Count) a=-3.1 Cascades in Blogs Q: What is the distribution of particular cascade shapes? Observation: Stars and chains in blog cascades also follow a power law, with different exponents (star -3.1, chain -8.5). 24
25
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 25 B1B1 B2B2 B4B4 B3B3
26
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 26 B1B1 B2B2 B4B4 B3B3 p 1,1
27
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 27 B1B1 B2B2 B4B4 B3B3 p 1,1
28
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 28 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1
29
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 29 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1
30
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 30 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1
31
Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 31 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1 p 4,1
32
Cascades in Blogs 32 Most frequent cascades model data log(Cascade size) (# nodes) log(Count) log(Star size) log(Count) log(Chain size) Data Model
33
Cascades in Blogs Limitations of SIS ▫Closed world assumption ▫Forced to set infection probability low to avoid large epidemics– possibly limits stars. ▫No time limit, possibly overestimates chains.
34
Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing
35
Chain Letter Cascades David Liben-Nowell, Jon Kleinberg. Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12. (March 2008), pp. 4633-4638. Goal: How can we trace the path of a meme, and explain these paths?
36
Chain Letter Cascades Data: NPR chain letter records. ▫People directed to sign and send back to admin ▫Had several copies of lists, overlaps ▫Reconstructed the trees using edit distance
37
Chain Letter Cascades A reconstruction:
38
Chain Letter Cascades The tree:
39
Chain Letter Cascades How to model? ▫These trees have much longer paths ▫2 considerations Spatial distance (geographic) Timing
40
Chain Letter Cascades Model: based on a delay distribution Nodes reply-to-all, so latecomers just append.
41
Chain Letter Cascades Validation: Simulated on a real social network (Livejournal), produced similar trees. Limitations: ▫The chain letter mechanism is somewhat nontraditional diffusion ▫Closed-world assumption is perhaps OK
42
Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-Based Marketing
43
Network-Based Marketing Shawndra Hill, Foster Provost, Chris Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275. Question: Is there statistical evidence that network linkage directly affects product adoption?
44
Network-Based Marketing Data: Direct-mail marketing campaign for adopting a new communications service. ▫21 target segments, millions of customers ▫Divided based on: Loyalty Previous adoptions Predictive scores based on other demographics Different marketing campaigns (postcards, calls)
45
Network-Based Marketing
46
Hypothesis: A customer who has had direct communication with a subscriber is more likely to adopt. ▫Data: (incomplete) network information ID of users, Timestamp, Duration To test, added a “NN” (network neighbor) flag to features if a customer had communicated with a subscriber. (0.3% overall)
47
Network-Based Marketing Created baseline statistical model based on node attributes. ▫“Loyalty”- how consumer used services in past ▫Geographic - city, state, etc. ▫Demographic- census-type data, credit score Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.
48
Network-Based Marketing Log-odds ratio for NN variable
49
Network-Based Marketing Take ratesLift ratios
50
Network-Based Marketing Added a “segment 22” consisting of only NN, but made up of less promising customers.
51
Network-Based Marketing What about causality? What if the adoption is due to homophily? To address this, sample from non-NN to make a similar data set to the NN group. Performed logistic regression, showed that network impact is highest for the least loyal group.
52
Network-Based Marketing Lift curve for NN
53
Network-Based Marketing What about other network features? ▫Degree, transactions, connectedness, etc. Added network features to existing regression model, tested lift.
54
Network-Based Marketing Lift in sales for both models
55
Conclusion Several ways of approaching the study of diffusion No model is perfect. Considerations: ▫Closed world assumption vs. external effects ▫Homophily and node attributes ▫Network structure Network information is valuable, but (usually) does not account for everything.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.