1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Slides:



Advertisements
Similar presentations
Submodular Set Function Maximization via the Multilinear Relaxation & Dependent Rounding Chandra Chekuri Univ. of Illinois, Urbana-Champaign.
Advertisements

An Introduction to Artificial Intelligence
Viral Marketing – Learning Influence Probabilities.
Learning Influence Probabilities in Social Networks 1 2 Amit Goyal 1 Francesco Bonchi 2 Laks V. S. Lakshmanan 1 U. of British Columbia Yahoo! Research.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Spread of Influence through a Social Network Adapted from :
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
DAVA: Distributing Vaccines over Networks under Prior Information
Maximizing the Spread of Influence through a Social Network
Guest lecture II: Amos Fiat’s Social Networks class Edith Cohen TAU, December 2014.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reducing the collection of itemsets: alternative representations and combinatorial problems.
Clustering and greedy algorithms — Part 2 Prof. Noah Snavely CS1114
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Belief Propagation, Junction Trees, and Factor Graphs
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Maximizing Product Adoption in Social Networks
Models of Influence in Online Social Networks
Diffusion in Social and Information Networks Part II W ORLD W IDE W EB 2015, F LORENCE MPI for Software SystemsGeorgia Institute of Technology Le Song.
On Ranking and Influence in Social Networks Huy Nguyen Lab seminar November 2, 2012.
Viral Marketing for Dedicated Customers Presented by: Cheng Long 25 August, 2012.
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
1 1 MPI for Intelligent Systems 2 Stanford University Manuel Gomez Rodriguez 1,2 David Balduzzi 1 Bernhard Schölkopf 1 UNCOVERING THE TEMPORAL DYNAMICS.
Personalized Influence Maximization on Social Networks
Efficient Gathering of Correlated Data in Sensor Networks
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Thang N. Dinh, Dung T. Nguyen, My T. Thai Dept. of Computer & Information Science & Engineering University of Florida, Gainesville, FL Hypertext-2012,
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
December 7-10, 2013, Dallas, Texas
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Problem Setting :Influence Maximization A new product is available in the market. Whom to give free samples to maximize the purchase of the product ? 1.
Manuel Gomez Rodriguez Structure and Dynamics of Information Pathways in On-line Media W ORKSHOP M ENORCA, MPI FOR I NTELLIGENT S YSTEMS.
Online Social Networks and Media
I NFORMATION C ASCADE Priyanka Garg. OUTLINE Information Propagation Virus Propagation Model How to model infection? Inferring Latent Social Networks.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Manuel Gomez Rodriguez Bernhard Schölkopf I NFLUENCE M AXIMIZATION IN C ONTINUOUS T IME D IFFUSION N ETWORKS , ICML ‘12.
A Latent Social Approach to YouTube Popularity Prediction Amandianeze Nwana Prof. Salman Avestimehr Prof. Tsuhan Chen.
Cost-effective Outbreak Detection in Networks Presented by Amlan Pradhan, Yining Zhou, Yingfei Xiang, Abhinav Rungta -Group 1.
1 1 MPI for Intelligent Systems 2 Stanford University Manuel Gomez Rodriguez 1,2 Bernhard Schölkopf 1 S UBMODULAR I NFERENCE OF D IFFUSION NETWORKS FROM.
F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and.
A Cooperative Coevolutionary Genetic Algorithm for Learning Bayesian Network Structures Arthur Carvalho
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University
Inferring Networks of Diffusion and Influence
Seed Selection.
Nanyang Technological University
Independent Cascade Model and Linear Threshold Model
Greedy & Heuristic algorithms in Influence Maximization
DM-Group Meeting Liangzhe Chen, Nov
Link Prediction and Network Inference
Distributed Submodular Maximization in Massive Datasets
Independent Cascade Model and Linear Threshold Model
The Importance of Communities for Learning to Influence
Coverage Approximation Algorithms
Cost-effective Outbreak Detection in Networks
Viral Marketing over Social Networks
Independent Cascade Model and Linear Threshold Model
Presentation transcript:

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

2 Networks and Processes  Many times hard to directly observe a social or information network:  Hidden/hard-to-reach populations:  Drug injection users  Implicit connections:  Network of information sharing in online media  Often easier to observe results of the processes taking place on such (invisible) networks:  Virus propagation:  People get sick, they see the doctor  Information networks:  Blogs mention information

3 Information Diffusion Network  Information diffuses through the network  We only see who mentions but not where they got the information from  Can we reconstruct the (hidden) diffusion network?

4 More Examples Virus propagation Word of mouth & Viral marketing Can we infer the underlying social or information network? Viruses propagate through the network We only observe when people get sick But NOT who infected them Recommendations and influence propagate We only observe when people buy products But NOT who influenced them Process We observe It’s hidden

5 Inferring the Network  There is a hidden directed network:  We only see times when nodes get infected:  Contagion c 1 : (a, 1), (c, 2), (b, 3), (e, 4)  Contagion c 2 : (c, 1), (a, 4), (b, 5), (d, 6)  Want to infer the who-infects-whom network b b d d e e a a c c a a c c b b e e c c a a b b d d

6 Our Problem Formulation  Plan for the talk: 1.Define a continuous time model of diffusion 2.Define the likelihood of the observed propagation data given a graph 3.Show how to efficiently compute the likelihood 4.Show how to efficiently optimize the likelihood  Find a graph G that maximizes the likelihood  There is a super-exponential number of graphs G  Our method finds a near-optimal solution in O(N 2 )!

7 c c c c e e f f e e f f c c b b a a b b a a a a b b d d Cascade Generation Model  Cascade generation model:  Cascade reaches u at t u, and spreads to u’s neighbors v:  with probability β cascade propagates along (u, v) and t v = t u + Δ, with Δ ~ f(  ) tata tbtb tctc Δ1Δ1 Δ2Δ2 We assume each node v has only one parent! Δ3Δ3 Δ4Δ4 tete tftf

8 Likelihood of a Cascade b b d d e e a a c c a a c c b b e e  If u infected v in a cascade c, its transmission probability is: P c (u, v)  f(t v - t u )with t v > t u and (u, v) are neighbors  Prob. that cascade c propagates in a tree T:  To model that in reality any node v in a cascade can have been infected by an external influence m: P c (m, j) = ε m m εε ε Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

9 Finding the Diffusion Network  There are many possible propagation trees:  c: (a, 1), (c, 2), (b, 3), (e, 4)  Need to consider all possible propagation trees T supported by G: b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e  Likelihood of a set of cascades C on G:  Want to find: Good news Computing P(c|G) is tractable: For each c, consider all O(n n ) possible transmission trees of G. Matrix Tree Theorem can compute this sum in O(n 3 )! Bad news We actually want to find: We have a super- exponential number of graphs!

10 An Alternative Formulation  We consider only the most likely tree  Maximum log-likelihood for a cascade c under a graph G:  Log-likelihood of G given a set of cascades C: The problem is NP-hard: MAX-k-COVER Our algorithm can do it near-optimally in O(N 2 )

11 Max Directed Spanning Tree Given a cascade c,  What is the most likely propagation tree? where b b d d a a c c Local greedy selection gives optimal tree!  A maximum directed spanning tree (MDST):  Just need to compute the MDST of a the sub- graph of G induced by c (i.e., a DAG)  For each node, just picks an in-edge of max- weight: Subgraph of G induced by c doesn’t have loops (DAG) a a b b d d c c

12 Great News: Submodularity  Theorem: Log-likelihood F c (G) is monotonic, and submodular in the edges of the graph G Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph F c (A  {e}) – F c (A) ≥ F c (B  {e}) – F c (B)  Proof: s s A  B  VxV w w’ x A B j j o o  Single cascade c, edge e with weight x  Let w be max weight in-edge of s in A  Let w’ be max weight in-edge of s in B  We know: w ≤ w’  Now: F c (A  {e}) – F c (A) = max (w, x) – w ≥ max (w’, x) – w’ = F c (B  {e}) – F c (B) r r a a k k i i i i k k Then, log-likelihood F C (G) is monotonic, and submodular too

13 Finding the Diffusion Graph  Use the greedy hill-climbing to maximize F C (G):  At every step, pick the edge that maximizes the marginal improvement b b d d e e a a c c Localized update Marginal gains b a c a d a a b c b d b e b : 12 : 3 : 6 : 20 : 18 : 4 : 5 a c b c b d c d e d : 15 : 8 : 16 : 8 : 10 b e d e : 7 : 13 : 17 : 2 : 3 Localized update : 1 : 8 : 7 : 6 1. Approximation guarantee (≈ 0.63 of OPT) 2. Tight on-line bounds on the solution quality 3. Speed-ups: Lazy evaluation (by submodularity) Localized update (by the structure of the problem) Benefits

14 Experimental Setup  We validate our method on:  How many edges of G can we find?  Precision-Recall  Break-even point  How many cascades do we need?  How fast are we?  How well do we optimize F c (G)? Synthetic data Generate a graph G on k edges Generate cascades Record node infection times Reconstruct G Real data MemeTracker: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes)

15  Small synthetic network: True network Baseline network Our method 15 Small Synthetic Example Pick k strongest edges:

16 Synthetic Networks  Our performance does not depend on the network structure:  Synthetic Networks: Forest Fire, Kronecker, etc.  Prob. of transmission: Exponential, Power Law  Break-even points of > 90% when the baseline gets %! 1024 node hierarchical Kronecker exponential transmission model 1000 node Forest Fire (α = 1.1) power law transmission model

17 How good is our graph?  We achieve ≈ 90 % of the best possible network!

18 How many cascades do we need?  With 2x as many infections as edges, the break-even point is already !

19 Running Time  Lazy evaluation and localized updates speed up 2 orders of magnitude!

20 Real Data  MemeTracker dataset:  172m news articles  Aug ’08 – Sept ‘09  343m textual phrases (quotes)  Times t c (w) when site w mentions phrase (quote) c  Given times when sites mention phrases  We infer the network of information diffusion:  Who tends to copy (repeat after) whom

21 Real Network  We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G  From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks: sites link other sites 2. cascades of (MemeTracker) quotes sites copy quotes from other sites Can we infer the hyperlinks network from… …cascades of hyperlinks? …cascades of MemeTracker quotes? Are they correlated? e e f f c c a a e e f f c c a a

22 Real Network 500 node hyperlink network using hyperlinks cascades 500 node hyperlink network using MemeTracker cascades  Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

23  5,000 news sites: Blogs Mainstream media Diffusion Network

24 Blogs Mainstream media Diffusion Network (small part)

25 Networks and Processes  We infer hidden networks based on diffusion data (timestamps)  Problem formulation in a maximum likelihood framework  NP-hard problem to solve exactly  We develop an approximation algorithm that:  It is efficient -> It runs in O(N 2 )  It is invariant to the structure of the underlying network  It gives a sub-optimal network with tight bound  Future work:  Learn both the network and the diffusion model  Extensions to other processes taking place on networks  Applications to other domains: biology, neuroscience, etc.

26 Thanks! For more (Code & Data):