Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Slides:



Advertisements
Similar presentations
Beyond Convexity – Submodularity in Machine Learning
Advertisements

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Spread of Influence through a Social Network Adapted from :
Maximizing the Spread of Influence through a Social Network
Cost-effective Outbreak Detection in Networks Jure Leskovec Machine Learning Department, CMU Joint work with Andreas Krause, Carlos Guestrin, Christos.
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
DAVA: Distributing Vaccines over Networks under Prior Information
Maximizing the Spread of Influence through a Social Network
Suqi Cheng Research Center of Web Data Sciences & Engineering
Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.
Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.
Variational Inference in Bayesian Submodular Models
Carnegie Mellon Selecting Observations against Adversarial Objectives Andreas Krause Brendan McMahan Carlos Guestrin Anupam Gupta TexPoint fonts used in.
Efficient Informative Sensing using Multiple Robots
A Decentralised Coordination Algorithm for Maximising Sensor Coverage in Large Sensor Networks Ruben Stranders, Alex Rogers and Nicholas R. Jennings School.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.
Computational Methods for Management and Economics Carla Gomes
Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
1 Dynamic Resource Allocation in Conservation Planning 1 Daniel GolovinAndreas Krause Beth Gardner Sarah Converse Steve Morey.
Near-optimal Observation Selection using Submodular Functions Andreas Krause joint work with Carlos Guestrin (CMU)
1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
Carnegie Mellon AI, Sensing, and Optimized Information Gathering: Trends and Directions Carlos Guestrin joint work with: and: Anupam Gupta, Jon Kleinberg,
Models of Influence in Online Social Networks
Personalized Influence Maximization on Social Networks
Dynamics of networks Jure Leskovec Computer Science Department
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
December 7-10, 2013, Dallas, Texas
Carnegie Mellon Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University.
Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Jure Leskovec Computer Science Department Cornell University / Stanford University.
Manuel Gomez Rodriguez Bernhard Schölkopf I NFLUENCE M AXIMIZATION IN C ONTINUOUS T IME D IFFUSION N ETWORKS , ICML ‘12.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
Cost-effective Outbreak Detection in Networks Presented by Amlan Pradhan, Yining Zhou, Yingfei Xiang, Abhinav Rungta -Group 1.
Distributed Optimization Yen-Ling Kuo Der-Yeuan Yu May 27, 2010.
1 1 MPI for Intelligent Systems 2 Stanford University Manuel Gomez Rodriguez 1,2 Bernhard Schölkopf 1 S UBMODULAR I NFERENCE OF D IFFUSION NETWORKS FROM.
Controlling Propagation at Group Scale on Networks Yao Zhang*, Abhijin Adiga +, Anil Vullikanti + *, and B. Aditya Prakash* *Department of Computer Science.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Inferring Networks of Diffusion and Influence
Independent Cascade Model and Linear Threshold Model
Greedy & Heuristic algorithms in Influence Maximization
Monitoring rivers and lakes [IJCAI ‘07]
Near-optimal Observation Selection using Submodular Functions
Who are the most influential bloggers?
ElasticTree Michael Fruchtman.
Optimizing Submodular Functions
Distributed Submodular Maximization in Massive Datasets
Jure Leskovec and Christos Faloutsos Machine Learning Department
Independent Cascade Model and Linear Threshold Model
The Importance of Communities for Learning to Influence
Effective Social Network Quarantine with Minimal Isolation Costs
Coverage Approximation Algorithms
Cost-effective Outbreak Detection in Networks
Viral Marketing over Social Networks
Independent Cascade Model and Linear Threshold Model
Presentation transcript:

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance

Scenario 1: Water network  Given a real city water distribution network  And data on how contaminants spread in the network  Problem posed by US Environmental Protection Agency 2 S On which nodes should we place sensors to efficiently detect the all possible contaminations? S

Scenario 2: Cascades in blogs 3 Blogs Posts Time ordered hyperlinks Information cascade Which blogs should one read to detect cascades as effectively as possible?

General problem  Given a dynamic process spreading over the network  We want to select a set of nodes to detect the process effectively  Many other applications:  Epidemics  Influence propagation  Network security 4

Two parts to the problem  Reward, e.g.:  1) Minimize time to detection  2) Maximize number of detected propagations  3) Minimize number of infected people  Cost (location dependent):  Reading big blogs is more time consuming  Placing a sensor in a remote location is expensive 5

Problem setting  Given a graph G(V,E)  and a budget B for sensors  and data on how contaminations spread over the network:  for each contamination i we know the time T(i, u) when it contaminated node u  Select a subset of nodes A that maximize the expected reward subject to cost(A) < B 6 SS Reward for detecting contamination i

Overview  Problem definition  Properties of objective functions  Submodularity  Our solution  CELF algorithm  New bound  Experiments  Conclusion 7

Solving the problem  Solving the problem exactly is NP-hard  Our observation:  objective functions are submodular, i.e. diminishing returns 8 S1S1 S2S2 Placement A={S 1, S 2 } S’ New sensor: Adding S’ helps a lot S2S2 S4S4 S1S1 S3S3 Placement A={S 1, S 2, S 3, S 4 } S’ Adding S’ helps very little

Result 1: Objective functions are submodular  Objective functions from Battle of Water Sensor Networks competition [Ostfeld et al]:  1) Time to detection (DT)  How long does it take to detect a contamination?  2) Detection likelihood (DL)  How many contaminations do we detect?  3) Population affected (PA)  How many people drank contaminated water?  Our result: all are submodular 9

Background: Submodularity  Submodularity:  For all placement s it holds  Even optimizing submodular functions is NP-hard [Khuller et al] 10 Benefit of adding a sensor to a small placement Benefit of adding a sensor to a large placement

Background: Optimizing submodular functions  How well can we do?  A greedy is near optimal  at least 1-1/e (~63%) of optimal [Nemhauser et al ’78]  But  1) this only works for unit cost case (each sensor/location costs the same)  2) Greedy algorithm is slow  scales as O(|V|B) 11 a b c a b c d d reward e e Greedy algorithm

Result 2: Variable cost: CELF algorithm  For variable sensor cost greedy can fail arbitrarily badly  We develop a CELF (cost-effective lazy forward-selection) algorithm  a 2 pass greedy algorithm  Theorem: CELF is near optimal  CELF achieves ½(1-1/e ) factor approximation  CELF is much faster than standard greedy 12

Result 3: tighter bound  We develop a new algorithm-independent bound  in practice much tighter than the standard (1-1/e) bound  Details in the paper 13

Scaling up CELF algorithm  Submodularity guarantees that marginal benefits decrease with the solution size  Idea: exploit submodularity, doing lazy evaluations! (considered by Robertazzi et al for unit cost case) 14 d reward

Result 4: Scaling up CELF  CELF algorithm:  Keep an ordered list of marginal benefits b i from previous iteration  Re-evaluate b i only for top sensor  Re-sort and prune 15 a b c a b c d d reward e e

Result 4: Scaling up CELF  CELF algorithm:  Keep an ordered list of marginal benefits b i from previous iteration  Re-evaluate b i only for top sensor  Re-sort and prune 16 a a b c d dbc reward e e

Result 4: Scaling up CELF  CELF algorithm:  Keep an ordered list of marginal benefits b i from previous iteration  Re-evaluate b i only for top sensor  Re-sort and prune 17 a c a b c d d b reward e e

Overview  Problem definition  Properties of objective functions  Submodularity  Our solution  CELF algorithm  New bound  Experiments  Conclusion 18

Experiments: Questions  Q1: How close to optimal is CELF?  Q2: How tight is our bound?  Q3: Unit vs. variable cost  Q4: CELF vs. heuristic selection  Q5: Scalability 19

Experiments: 2 case studies  We have real propagation data  Blog network:  We crawled blogs for 1 year  We identified cascades – temporal propagation of information  Water distribution network:  Real city water distribution networks  Realistic simulator of water consumption provided by US Environmental Protection Agency 20

Case study 1: Cascades in blogs  We crawled 45,000 blogs for 1 year  We obtained 10 million posts  And identified 350,000 cascades 21

Q1: Blogs: Solution quality  Our bound is much tighter  13% instead of 37% 22 Old bound Our bound CELF

Q2: Blogs: Cost of a blog  Unit cost:  algorithm picks large popular blogs: instapundit.com, michellemalkin.com  Variable cost:  proportional to the number of posts  We can do much better when considering costs 23 Unit cost Variable cost

Q4: Blogs: Heuristics  CELF wins consistently 24

Q5: Blogs: Scalability  CELF runs 700 times faster than simple greedy algorithm 25

Case study 2: Water network  Real metropolitan area water network (largest network optimized):  V = 21,000 nodes  E = 25,000 pipes  3.6 million epidemic scenarios (152 GB of epidemic data)  By exploiting sparsity we fit it into main memory (16GB) 26

Q1: Water: Solution quality  Again our bound is much tighter 27 Old bound Our bound CELF

Q3: Water: Heuristic placement  Again, CELF consistently wins 28

Q5: Water: Scalability  CELF is 10 times faster than greedy 29

Results of BWSN competition Author #non- dominated (out of 30) CELF 26 Berry et. al. 21 Dorini et. al. 20 Wu and Walski 19 Ostfeld et al 14 Propato et. al. 12 Eliades et. al. 11 Huang et. al. 7 Guan et. al. 4 Ghimire et. al. 3 Trachtman 2 Gueli 2 Preis and Ostfeld 1 30  Battle of Water Sensor Networks competition  [Ostfeld et al]: count number of non-dominated solutions

Conclusion  General methodology for selecting nodes to detect outbreaks  Results:  Submodularity observation  Variable-cost algorithm with optimality guarantee  Tighter bound  Significant speed-up (700 times)  Evaluation on large real datasets (150GB)  CELF won consistently 31

Other results – see our poster  Many more details:  Fractional selection of the blogs  Generalization to future unseen cascades  Multi-criterion optimization  We show that triggering model of Kempe et al is a special case of out setting 32 Thank you! Questions?