Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Diffusion Mary McGlohon CMU 10-802 3/23/10.

Similar presentations


Presentation on theme: "Information Diffusion Mary McGlohon CMU 10-802 3/23/10."— Presentation transcript:

1 Information Diffusion Mary McGlohon CMU 10-802 3/23/10

2 Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing

3 Epidemiological: SIS Susceptible, Infected, Susceptible ▫Infected for t I timesteps ▫While infected, transmits with probability  ▫After t I steps, returns to susceptible

4 Epidemiological: SIR Susceptible, Infected, Removed ▫Infected for t I timesteps ▫While infected, transmits with probability  ▫After t I steps, goes to removed/recovered

5 Epidemiological: SIRS Susceptible, Infected, Removed, Susceptible ▫Combination of SIS+SIR ▫After t I steps, goes to removed/recovered ▫After t R steps, returns to susceptible

6 Epidemiological: Networks Historically, SIS/SIR assumed a person could infect anybody else, full clique. There is an epidemic threshold in SIS. For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani] ▫(But not for PL networks with high clustering coefficients [Egu´ıluz and Klemm])

7 Threshold Models Each node in network has weighted threshold If adopted neighbors reaches threshold, the node adopts.

8 Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing

9 Info Diffusion in Blogs D. Gruhl, R. Guha, Liben D. Nowell, A. Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004). Goal: How do topics trend in blogs, and how can we model diffusion of topics?

10 Info Diffusion in Blogs Data: Crawled 11K blogs, 400K posts. Found 34o topics: ▫apple arianna ashcroft astronaut blair boykin bustamante chibi china davis diana farfarello guantanamo harvard kazaa longhorn schwarzenegger udell siegfried wildfires zidane gizmodo microsoft saddam

11 Info Diffusion in Blogs Topics = Chatter + Spikes ▫Chatter: Alzheimer ▫Spike: Chibi ▫Spiky Chatter: Microsoft

12 Info Diffusion in Blogs Modeled as SIR ▫Some set of authors is infected to write about a topic ▫Then propagate, as others write new posts on that topic ▫Measure the topic over time and other properties Fit using EM ▫Compute probability of propagation along each edge

13 Info Diffusion in Blogs Validation: ▫Synthetic  Used modified Erdos-Renyi graph, created propagation  Found that EM was able to identify transmission of most edges ▫Real  Found “internet-only” topics  Looked at most highly ranked expected transmission links, identified a real link in 90% of cases

14 Info Diffusion in Blogs Limitations of SIR ▫No multiple postings ▫No “stickiness”, which topics resonate with whom ▫No time limiting factor in topics ▫“Closed world assumption”  No outside influences after initial infection

15 Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing

16 Cascades in Blogs Jure Leskovec, Mary Mcglohon, Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007) Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?

17 Cascades in Blogs Data: ▫Gathered from August-September 2005 ▫Used set of 44,362 blogs, 2.4 million posts ▫245,404 blog-to-blog links 17 Time [1 day] Number of posts Jul 4 Aug 1 Sep 29

18 Cascades in Blogs 18 Blogosphere B1B1 B2B2 B4B4 B3B3 Cascades d e b c e a a b c d e “Star” “Chain” What is the timing of links? What are cascade sizes? What are cascade shapes?

19 19 Cascades in Blogs What is the timing of links? Does popularity decay at a constant rate? With an exponential (“half life”)? Linear-linear scaleLog-linear scaleLog-log scale

20 Cascades in Blogs Observation: The probability that a post written at time t p acquires a link at time t p + Δ is: p(t p + Δ ) ∝ Δ -1.5 20 log(days after post) log( # in-links) slope=-1.5 (Linear-linear scale)

21 21 Cascades in Blogs How are cascade sizes distributed? Geometric distribution? Linear-linear scaleLog-linear scaleLog-log scale d e b c e a

22 Cascades in Blogs Q: What size distribution do cascades follow? Are large cascades frequent? Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution: p(n) ∝ n -2 22 log(Cascade size) (# of nodes) log(Count) slope=-2 d e b c e a

23 23 Cascades in Blogs How are cascade shapes distributed? More stars? More chains? d e b c e a

24 log(Size) of chain (# nodes) log(Count) a=-8.5 log(Size) of star (# nodes) log(Count) a=-3.1 Cascades in Blogs Q: What is the distribution of particular cascade shapes? Observation: Stars and chains in blog cascades also follow a power law, with different exponents (star -3.1, chain -8.5). 24

25 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 25 B1B1 B2B2 B4B4 B3B3

26 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 26 B1B1 B2B2 B4B4 B3B3 p 1,1

27 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 27 B1B1 B2B2 B4B4 B3B3 p 1,1

28 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 28 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1

29 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 29 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1

30 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 30 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1

31 Cascades in Blogs Based on SIS model in epidemiology ▫Randomly pick blog to infect, add post to cascade ▫Infect each in-linked neighbor with probability  ▫Add infected neighbors’ posts to cascade. ▫Set old infected node to uninfected. 31 B1B1 B2B2 B4B4 B3B3 p 1,1 p 4,1 p 2,1 p 4,1

32 Cascades in Blogs 32 Most frequent cascades model data log(Cascade size) (# nodes) log(Count) log(Star size) log(Count) log(Chain size) Data Model

33 Cascades in Blogs Limitations of SIS ▫Closed world assumption ▫Forced to set infection probability low to avoid large epidemics– possibly limits stars. ▫No time limit, possibly overestimates chains.

34 Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-based Marketing

35 Chain Letter Cascades David Liben-Nowell, Jon Kleinberg. Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12. (March 2008), pp. 4633-4638. Goal: How can we trace the path of a meme, and explain these paths?

36 Chain Letter Cascades Data: NPR chain letter records. ▫People directed to sign and send back to admin ▫Had several copies of lists, overlaps ▫Reconstructed the trees using edit distance

37 Chain Letter Cascades A reconstruction:

38 Chain Letter Cascades The tree:

39 Chain Letter Cascades How to model? ▫These trees have much longer paths ▫2 considerations  Spatial distance (geographic)  Timing

40 Chain Letter Cascades Model: based on a delay distribution Nodes reply-to-all, so latecomers just append.

41 Chain Letter Cascades Validation: Simulated on a real social network (Livejournal), produced similar trees. Limitations: ▫The chain letter mechanism is somewhat nontraditional diffusion ▫Closed-world assumption is perhaps OK

42 Outline Intro: Models for diffusion ▫Epidemiological: SIS/SIR/SIRS ▫Threshold models Case studies ▫SIR: Info diffusion in blogs ▫SIS: Cascades in blogs ▫Timing: Cascades in chain letters ▫A closer look: Network-Based Marketing

43 Network-Based Marketing Shawndra Hill, Foster Provost, Chris Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275. Question: Is there statistical evidence that network linkage directly affects product adoption?

44 Network-Based Marketing Data: Direct-mail marketing campaign for adopting a new communications service. ▫21 target segments, millions of customers ▫Divided based on:  Loyalty  Previous adoptions  Predictive scores based on other demographics  Different marketing campaigns (postcards, calls)

45 Network-Based Marketing

46 Hypothesis: A customer who has had direct communication with a subscriber is more likely to adopt. ▫Data: (incomplete) network information  ID of users, Timestamp, Duration To test, added a “NN” (network neighbor) flag to features if a customer had communicated with a subscriber. (0.3% overall)

47 Network-Based Marketing Created baseline statistical model based on node attributes. ▫“Loyalty”- how consumer used services in past ▫Geographic - city, state, etc. ▫Demographic- census-type data, credit score Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.

48 Network-Based Marketing Log-odds ratio for NN variable

49 Network-Based Marketing Take ratesLift ratios

50 Network-Based Marketing Added a “segment 22” consisting of only NN, but made up of less promising customers.

51 Network-Based Marketing What about causality? What if the adoption is due to homophily? To address this, sample from non-NN to make a similar data set to the NN group. Performed logistic regression, showed that network impact is highest for the least loyal group.

52 Network-Based Marketing Lift curve for NN

53 Network-Based Marketing What about other network features? ▫Degree, transactions, connectedness, etc. Added network features to existing regression model, tested lift.

54 Network-Based Marketing Lift in sales for both models

55 Conclusion Several ways of approaching the study of diffusion No model is perfect. Considerations: ▫Closed world assumption vs. external effects ▫Homophily and node attributes ▫Network structure Network information is valuable, but (usually) does not account for everything.


Download ppt "Information Diffusion Mary McGlohon CMU 10-802 3/23/10."

Similar presentations


Ads by Google