1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3
2 Networks and Processes Many times hard to directly observe a social or information network: Hidden/hard-to-reach populations: Drug injection users Implicit connections: Network of information sharing in online media Often easier to observe results of the processes taking place on such (invisible) networks: Virus propagation: People get sick, they see the doctor Information networks: Blogs mention information
3 Information Diffusion Network Information diffuses through the network We only see who mentions but not where they got the information from Can we reconstruct the (hidden) diffusion network?
4 More Examples Virus propagation Word of mouth & Viral marketing Can we infer the underlying social or information network? Viruses propagate through the network We only observe when people get sick But NOT who infected them Recommendations and influence propagate We only observe when people buy products But NOT who influenced them Process We observe It’s hidden
5 Inferring the Network There is a hidden directed network: We only see times when nodes get infected: Contagion c 1 : (a, 1), (c, 2), (b, 3), (e, 4) Contagion c 2 : (c, 1), (a, 4), (b, 5), (d, 6) Want to infer the who-infects-whom network b b d d e e a a c c a a c c b b e e c c a a b b d d
6 Our Problem Formulation Plan for the talk: 1.Define a continuous time model of diffusion 2.Define the likelihood of the observed propagation data given a graph 3.Show how to efficiently compute the likelihood 4.Show how to efficiently optimize the likelihood Find a graph G that maximizes the likelihood There is a super-exponential number of graphs G Our method finds a near-optimal solution in O(N 2 )!
7 c c c c e e f f e e f f c c b b a a b b a a a a b b d d Cascade Generation Model Cascade generation model: Cascade reaches u at t u, and spreads to u’s neighbors v: with probability β cascade propagates along (u, v) and t v = t u + Δ, with Δ ~ f( ) tata tbtb tctc Δ1Δ1 Δ2Δ2 We assume each node v has only one parent! Δ3Δ3 Δ4Δ4 tete tftf
8 Likelihood of a Cascade b b d d e e a a c c a a c c b b e e If u infected v in a cascade c, its transmission probability is: P c (u, v) f(t v - t u )with t v > t u and (u, v) are neighbors Prob. that cascade c propagates in a tree T: To model that in reality any node v in a cascade can have been infected by an external influence m: P c (m, j) = ε m m εε ε Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
9 Finding the Diffusion Network There are many possible propagation trees: c: (a, 1), (c, 2), (b, 3), (e, 4) Need to consider all possible propagation trees T supported by G: b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e Likelihood of a set of cascades C on G: Want to find: Good news Computing P(c|G) is tractable: For each c, consider all O(n n ) possible transmission trees of G. Matrix Tree Theorem can compute this sum in O(n 3 )! Bad news We actually want to find: We have a super- exponential number of graphs!
10 An Alternative Formulation We consider only the most likely tree Maximum log-likelihood for a cascade c under a graph G: Log-likelihood of G given a set of cascades C: The problem is NP-hard: MAX-k-COVER Our algorithm can do it near-optimally in O(N 2 )
11 Max Directed Spanning Tree Given a cascade c, What is the most likely propagation tree? where b b d d a a c c Local greedy selection gives optimal tree! A maximum directed spanning tree (MDST): Just need to compute the MDST of a the sub- graph of G induced by c (i.e., a DAG) For each node, just picks an in-edge of max- weight: Subgraph of G induced by c doesn’t have loops (DAG) a a b b d d c c
12 Great News: Submodularity Theorem: Log-likelihood F c (G) is monotonic, and submodular in the edges of the graph G Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph F c (A {e}) – F c (A) ≥ F c (B {e}) – F c (B) Proof: s s A B VxV w w’ x A B j j o o Single cascade c, edge e with weight x Let w be max weight in-edge of s in A Let w’ be max weight in-edge of s in B We know: w ≤ w’ Now: F c (A {e}) – F c (A) = max (w, x) – w ≥ max (w’, x) – w’ = F c (B {e}) – F c (B) r r a a k k i i i i k k Then, log-likelihood F C (G) is monotonic, and submodular too
13 Finding the Diffusion Graph Use the greedy hill-climbing to maximize F C (G): At every step, pick the edge that maximizes the marginal improvement b b d d e e a a c c Localized update Marginal gains b a c a d a a b c b d b e b : 12 : 3 : 6 : 20 : 18 : 4 : 5 a c b c b d c d e d : 15 : 8 : 16 : 8 : 10 b e d e : 7 : 13 : 17 : 2 : 3 Localized update : 1 : 8 : 7 : 6 1. Approximation guarantee (≈ 0.63 of OPT) 2. Tight on-line bounds on the solution quality 3. Speed-ups: Lazy evaluation (by submodularity) Localized update (by the structure of the problem) Benefits
14 Experimental Setup We validate our method on: How many edges of G can we find? Precision-Recall Break-even point How many cascades do we need? How fast are we? How well do we optimize F c (G)? Synthetic data Generate a graph G on k edges Generate cascades Record node infection times Reconstruct G Real data MemeTracker: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes)
15 Small synthetic network: True network Baseline network Our method 15 Small Synthetic Example Pick k strongest edges:
16 Synthetic Networks Our performance does not depend on the network structure: Synthetic Networks: Forest Fire, Kronecker, etc. Prob. of transmission: Exponential, Power Law Break-even points of > 90% when the baseline gets %! 1024 node hierarchical Kronecker exponential transmission model 1000 node Forest Fire (α = 1.1) power law transmission model
17 How good is our graph? We achieve ≈ 90 % of the best possible network!
18 How many cascades do we need? With 2x as many infections as edges, the break-even point is already !
19 Running Time Lazy evaluation and localized updates speed up 2 orders of magnitude!
20 Real Data MemeTracker dataset: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes) Times t c (w) when site w mentions phrase (quote) c Given times when sites mention phrases We infer the network of information diffusion: Who tends to copy (repeat after) whom
21 Real Network We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks: sites link other sites 2. cascades of (MemeTracker) quotes sites copy quotes from other sites Can we infer the hyperlinks network from… …cascades of hyperlinks? …cascades of MemeTracker quotes? Are they correlated? e e f f c c a a e e f f c c a a
22 Real Network 500 node hyperlink network using hyperlinks cascades 500 node hyperlink network using MemeTracker cascades Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!
23 5,000 news sites: Blogs Mainstream media Diffusion Network
24 Blogs Mainstream media Diffusion Network (small part)
25 Networks and Processes We infer hidden networks based on diffusion data (timestamps) Problem formulation in a maximum likelihood framework NP-hard problem to solve exactly We develop an approximation algorithm that: It is efficient -> It runs in O(N 2 ) It is invariant to the structure of the underlying network It gives a sub-optimal network with tight bound Future work: Learn both the network and the diffusion model Extensions to other processes taking place on networks Applications to other domains: biology, neuroscience, etc.
26 Thanks! For more (Code & Data):