Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Constraint Satisfaction Problems
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Randomized Algorithms Randomized Algorithms CS648 1.
PP Test Review Sections 6-1 to 6-6
The Weighted Proportional Resource Allocation Milan Vojnović Microsoft Research Joint work with Thành Nguyen Microsoft Research Asia, Beijing, April, 2011.
EU market situation for eggs and poultry Management Committee 20 October 2011.
1 Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization Joint work with Andreas Krause 1 Daniel Golovin.
2 |SharePoint Saturday New York City
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Perfect Competition.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Equal or Not. Equal or Not
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Energy Generation in Mitochondria and Chlorplasts
Amit Goyal Laks V. S. Lakshmanan RecMax: Exploiting Recommender Systems for Fun and Profit University of British Columbia
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Cost-effective Outbreak Detection in Networks Jure Leskovec Machine Learning Department, CMU Joint work with Andreas Krause, Carlos Guestrin, Christos.
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Cost-effective Outbreak Detection in Networks Presented by Amlan Pradhan, Yining Zhou, Yingfei Xiang, Abhinav Rungta -Group 1.
Monitoring rivers and lakes [IJCAI ‘07]
Near-optimal Observation Selection using Submodular Functions
Who are the most influential bloggers?
Cost-effective Outbreak Detection in Networks
Presentation transcript:

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance

Scenario 1: Water network Given a real city water distribution network And data on how contaminants spread in the network Problem posed by US Environmental Protection Agency 2 S On which nodes should we place sensors to efficiently detect the all possible contaminations? S

Scenario 2: Cascades in blogs 3 Blogs Posts Time ordered hyperlinks Information cascade Which blogs should one read to detect cascades as effectively as possible?

General problem Given a dynamic process spreading over the network We want to select a set of nodes to detect the process effectively Many other applications: Epidemics Influence propagation Network security 4

Two parts to the problem Reward, e.g.: 1) Minimize time to detection 2) Maximize number of detected propagations 3) Minimize number of infected people Cost (location dependent): Reading big blogs is more time consuming Placing a sensor in a remote location is expensive 5

Problem setting Given a graph G(V,E) and a budget B for sensors and data on how contaminations spread over the network: for each contamination i we know the time T(i, u) when it contaminated node u Select a subset of nodes A that maximize the expected reward subject to cost(A) < B 6 SS Reward for detecting contamination i

Overview Problem definition Properties of objective functions Submodularity Our solution CELF algorithm New bound Experiments Conclusion 7

Solving the problem Solving the problem exactly is NP-hard Our observation: objective functions are submodular, i.e. diminishing returns 8 S1S1 S2S2 Placement A={S 1, S 2 } S New sensor: Adding S helps a lot S2S2 S4S4 S1S1 S3S3 Placement A={S 1, S 2, S 3, S 4 } S Adding S helps very little

Result 1: Objective functions are submodular Objective functions from Battle of Water Sensor Networks competition [Ostfeld et al]: 1) Time to detection (DT) How long does it take to detect a contamination? 2) Detection likelihood (DL) How many contaminations do we detect? 3) Population affected (PA) How many people drank contaminated water? Our result: all are submodular 9

Background: Submodularity Submodularity: For all placement s it holds Even optimizing submodular functions is NP-hard [Khuller et al] 10 Benefit of adding a sensor to a small placement Benefit of adding a sensor to a large placement

Background: Optimizing submodular functions How well can we do? A greedy is near optimal at least 1-1/e (~63%) of optimal [Nemhauser et al 78] But 1) this only works for unit cost case (each sensor/location costs the same) 2) Greedy algorithm is slow scales as O(|V|B) 11 a b c a b c d d reward e e Greedy algorithm

Result 2: Variable cost: CELF algorithm For variable sensor cost greedy can fail arbitrarily badly We develop a CELF (cost-effective lazy forward-selection) algorithm a 2 pass greedy algorithm Theorem: CELF is near optimal CELF achieves ½(1-1/e ) factor approximation CELF is much faster than standard greedy 12

Result 3: tighter bound We develop a new algorithm-independent bound in practice much tighter than the standard (1-1/e) bound Details in the paper 13

Scaling up CELF algorithm Submodularity guarantees that marginal benefits decrease with the solution size Idea: exploit submodularity, doing lazy evaluations! (considered by Robertazzi et al for unit cost case) 14 d reward

Result 4: Scaling up CELF CELF algorithm: Keep an ordered list of marginal benefits b i from previous iteration Re-evaluate b i only for top sensor Re-sort and prune 15 a b c a b c d d reward e e

Result 4: Scaling up CELF CELF algorithm: Keep an ordered list of marginal benefits b i from previous iteration Re-evaluate b i only for top sensor Re-sort and prune 16 a a b c d dbc reward e e

Result 4: Scaling up CELF CELF algorithm: Keep an ordered list of marginal benefits b i from previous iteration Re-evaluate b i only for top sensor Re-sort and prune 17 a c a b c d d b reward e e

Overview Problem definition Properties of objective functions Submodularity Our solution CELF algorithm New bound Experiments Conclusion 18

Experiments: Questions Q1: How close to optimal is CELF? Q2: How tight is our bound? Q3: Unit vs. variable cost Q4: CELF vs. heuristic selection Q5: Scalability 19

Experiments: 2 case studies We have real propagation data Blog network: We crawled blogs for 1 year We identified cascades – temporal propagation of information Water distribution network: Real city water distribution networks Realistic simulator of water consumption provided by US Environmental Protection Agency 20

Case study 1: Cascades in blogs We crawled 45,000 blogs for 1 year We obtained 10 million posts And identified 350,000 cascades 21

Q1: Blogs: Solution quality Our bound is much tighter 13% instead of 37% 22 Old bound Our bound CELF

Q2: Blogs: Cost of a blog Unit cost: algorithm picks large popular blogs: instapundit.com, michellemalkin.com Variable cost: proportional to the number of posts We can do much better when considering costs 23 Unit cost Variable cost

Q4: Blogs: Heuristics CELF wins consistently 24

Q5: Blogs: Scalability CELF runs 700 times faster than simple greedy algorithm 25

Case study 2: Water network Real metropolitan area water network (largest network optimized): V = 21,000 nodes E = 25,000 pipes 3.6 million epidemic scenarios (152 GB of epidemic data) By exploiting sparsity we fit it into main memory (16GB) 26

Q1: Water: Solution quality Again our bound is much tighter 27 Old bound Our bound CELF

Q3: Water: Heuristic placement Again, CELF consistently wins 28

Water: Placement visualization Different objective functions give different sensor placements 29 Population affected Detection likelihood

Q5: Water: Scalability CELF is 10 times faster than greedy 30

Results of BWSN competition Author #non- dominated (out of 30) CELF 26 Berry et. al. 21 Dorini et. al. 20 Wu and Walski 19 Ostfeld et al 14 Propato et. al. 12 Eliades et. al. 11 Huang et. al. 7 Guan et. al. 4 Ghimire et. al. 3 Trachtman 2 Gueli 2 Preis and Ostfeld 1 31 Battle of Water Sensor Networks competition [Ostfeld et al]: count number of non-dominated solutions

Conclusion General methodology for selecting nodes to detect outbreaks Results: Submodularity observation Variable-cost algorithm with optimality guarantee Tighter bound Significant speed-up (700 times) Evaluation on large real datasets (150GB) CELF won consistently 32

Other results – see our poster Many more details: Fractional selection of the blogs Generalization to future unseen cascades Multi-criterion optimization We show that triggering model of Kempe et al is a special case of out setting 33 Thank you! Questions?

Blogs: generalization 34

Blogs: Cost of a blog (2) But then algorithm picks lots of small blogs that participate in few cascades We pick best solution that interpolates between the costs We can get good solutions with few blogs and few posts 35 Each curve represents solutions with the same score