F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and.

F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and John E. Hopcroft

B ACKGROUND Diffusion processes common in many types of networks Cascading examples contact networks <> infections friendship networks <> gossips social networks <> products academic networks <> ideas

B ACKGROUND Typically, network structure assumed known Many interesting questions minimize spread (vaccinations) maximize spread (viral marketing) interdictions What if the underlying network is unknown?

N ETWORK I NFERENCE N ET I NF [Gomez-Rodriguez et al. 2010] input: actual number of edges in the latent network observations of information cascades output: set of edges maximizing the likelihood of the observations submodular N ET R ATE [Gomez-Rodriguez et al. 2011] input: observations of information cascades output: set of transmission rates maximizing the likelihood of the observations convex optimization problem

C ASCADES (v 0,t 0 1 ) (v 1,t 1 1 )(v 2,t 2 1 )(v 3,t 3 1 )(v 4,t 4 1 ) (v 5,t 0 2 )(v 6,t 1 2 ) (v 2,t 2 2 ) (v 7,t 3 2 )(v 8,t 4 2 ) (v 9,t 0 3 ) (v 3,t 1 3 ) (v 7,t 2 3 ) (v 10,t 3 3 )(v 11,t 4 3 ) Given observations of a diffusion process, what can we infer about the underlying network?

M OTIVATING E XAMPLE information diffusion in the Twitter following network

P REVIOUS W ORK Major assumptions the diffusion process is causal (not affected by events in the future) the diffusion process is monotonic (can be infected at most once) infection events closer in time are more likely to be causally related (e.g., exponential, Rayleigh, or power-law distribution) Time-stamps are not sufficient most real-world diffusion processes are recurrent cascades are often a mixture of (geographically) local sub-cascades cannot tell them apart by just looking at time-stamps many other informative factors (e.g., language, pairwise similarity) Our work generalizes previous models to take these factors into account.

P ROBLEM D EFINITION Weighted, directed graph G=(V, E) known: node set V unknown: weighted edge set E Observations: generalized cascades {π 1, π 2,…, π M } 957BenFM #ladygaga always rocks… 2frog #ladygaga bella canzone… AbbeyResort #followfriday see you all tonight… figmentations #followfriday cannot wait… 2frog #followfriday 周五活动计划 …

P ROBLEM D EFINITION Given set of vertices V set of generalized cascades {π 1, π 2,…, π M } generative probabilistic model (feature-enhanced) Goal: find the most likely adjacency matrix of transmission rates A={α jk |j,k  V,j  k} latent network observed cascades {π 1, π 2,…, π M }

F EATURE -E NHANCED M ODEL Multiple occurrences splitting: an infection event of a node is the result of all previous events up to its last infection (memoryless) non-splitting: an infection event is the result of all previous events Independent of future infection events (causal process)

F EATURE -E NHANCED M ODEL Generalized cascade: assumption 1: events closer in time are more likely to be causally related assumption 2: events closer in feature space are more likely to be causally related

G ENERATIVE M ODEL diffusion distribution (exponental, Rayleigh, etc.) assumed network A Given model and observed cascades, the likelihood of an assumed network A : enough edges so that every infection event can be explained (reward) for every infected node, for each of its neighbors, how long does it take for the neighbor to become infected? (penalty) why not infected at all? (penalty) distance between events probability of being causally related

O PTIMIZATION F RAMEWORK assumed network A maximize L ( π 1, π 2 | A ) 1. convex in A 2. decomposable diffusion distribution (exponental, Rayleigh, etc.)

E XPERIMENTAL S ETUP Dataset Twitter (66,679 nodes; 240,637 directed edges) Cascades (500 hashtags; 103,148 tweets) Ground truth known Feature Model language pairwise similarity combination

E XPERIMENTAL S ETUP Baselines N ET I NF (takes true number of edges as input) N ET R ATE Language Detector the language is computed using the n-gram model noisy estimates Convex Optimization limited-memory BFGS algorithm with box constraints CVXOPT cannot handle the scale of our Twitter dataset All algorithms are implemented using Python with the Fortran implementation of LBFGS-B available in Scipy, and all experiments are performed on a machine running CentOS Linux with a 6-core Intel x5690 3.46GHZ CPU and 48GB memory.

P ERFORMANCE C OMPARISON Non-Splitting Exponential M ETRIC N ET I NF N ET R ATEMO N ETMO N ET +L MO N ET +J MO N ET +LJ P RECISION 0.3620.5920.4340.4640.5240.533 R ECALL 0.3620.0690.3070.3740.4500.483 F1- SCORE 0.3620.1240.3590.4140.4840.507 TP 51899439535644692 FP 91462573618586606 FN 9141333993897788740 66% 0.362 0.592 0.069

P ERFORMANCE C OMPARISON Splitting Exponential M ETRIC N ET I NF N ET R ATEMO N ETMO N ET +L MO N ET +J MO N ET +LJ P RECISION 0.3620.5920.5140.5160.5310.534 R ECALL 0.3620.0690.5990.6050.6180.635 F1- SCORE 0.3620.1240.5540.5570.5710.581 TP 51899858867885910 FP 91462810812781793 FN 9141333574565547522 79%

P ERFORMANCE C OMPARISON Non-Splitting Rayleigh M ETRIC N ET I NF N ET R ATEMO N ETMO N ET +L MO N ET +J MO N ET +LJ P RECISION 0.3540.5600.4200.4540.4790.484 R ECALL 0.3540.0720.2180.2620.2860.294 F1- SCORE 0.3540.1270.2870.3320.3580.366 TP 507103312375409421 FP 92581430451445449 FN 92513291120105710231011 65% 0.354 0.560 0.072

P ERFORMANCE C OMPARISON Splitting Rayleigh M ETRIC N ET I NF N ET R ATEMO N ETMO N ET +L MO N ET +J MO N ET +LJ P RECISION 0.3540.5600.4800.4930.4950.499 R ECALL 0.3540.0720.5620.5660.5700.572 F1- SCORE 0.3540.1270.5180.5270.5300.533 TP 507103805811816819 FP 92581872835834821 FN 9251329627621616613 76%

C ONCLUSION Feature-enhanced probabilistic models to infer the latent network from observations of a diffusion process Primary approach MO N ET with non-splitting and splitting solutions to handle recurrent processes Our models consider not only the relative time differences between infection events, but also a richer set of features. The inference problem still involves convex optimization. It can be decomposed into smaller sub-problems that we can efficiently solve in parallel. Improved performance on Twitter

THANK YOU!

F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and.

Similar presentations

Presentation on theme: "F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and.

Similar presentations

Presentation on theme: "F EATURE -E NHANCED P ROBABILISTIC M ODELS FOR D IFFUSION N ETWORK I NFERENCE Stefano Ermon ECML-PKDD September 26, 2012 Joint work with Liaoruo Wang and."— Presentation transcript:

Similar presentations

About project

Feedback