Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University.

Similar presentations


Presentation on theme: "Carnegie Mellon Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University."— Presentation transcript:

1 Carnegie Mellon Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University

2 Tutorial Outline 1.Introduction 2.Submodularity: properties and examples 3.Maximizing submodular function 4.Applications in Machine Learning

3 3 Feature selection Given random variables Y, X 1, … X n Want to predict Y from subset X A = (X i 1,…,X i k ) Want k most informative features: A* = argmax IG(X A ; Y) s.t. |A| · k where IG(X A ; Y) = H(Y) - H(Y | X A ) Problem inherently combinatorial! Y “Sick” X 1 “Fever” X 2 “Rash” X 3 “Male” Naïve Bayes Model Uncertainty before knowing X A Uncertainty after knowing X A

4 4 Factoring distributions Given random variables X 1,…,X n Partition variables V into sets A and V n A as independent as possible Formally: Want A* = argmin A I(X A ; X V n A ) s.t. 0<|A|<n where I(X A,X B ) = H(X B ) - H(X B j X A ) Fundamental building block in structure learning [Narasimhan&Bilmes, UAI ’04] Problem inherently combinatorial! X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X1X1 X3X3 X4X4 X6X6 X2X2 X5X5 X7X7 A VnAVnA V

5 5 Combinatorial problems in ML Given a (finite) set V, function F: 2 V ! R, want A* = argmin F(A) s.t. some constraints on A Solving combinatorial problems: Mixed integer programming? Often difficult to scale to large problems Relaxations? (e.g., L1 regularization, etc.) Not clear when they work This talk: Fully combinatorial algorithms (spanning tree, matching, …)

6 6 Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(X A ; Y) Want: A * µ V such that NP-hard! How well can this simple heuristic do? Greedy algorithm: Start with A = ; For i = 1 to k s* := argmax s F(A [ {s}) A := A [ {s*} Y “Sick” X 1 “Fever” X 2 “Rash” X 3 “Male” M

7 7 s Key property: Diminishing returns Selection A = {} Selection B = {X 2,X 3 } Adding X 1 will help a lot! Adding X 1 doesn’t help much New feature X 1 B A s + + Large improvement Small improvement For A µ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) Submodularity: Y “Sick” X 1 “Fever” X 2 “Rash” X 3 “Male” Y “Sick” Theorem [Krause, Guestrin UAI ‘05] : Information gain F(A) in Naïve Bayes models is submodular!

8 8 Why is submodularity useful? Theorem [Nemhauser et al ‘78] Greedy maximization algorithm returns A greedy : F(A greedy ) ¸ (1-1/e) max |A| · k F(A) Greedy algorithm gives near-optimal solution! More details and exact statement later For info-gain: Guarantees best possible unless P = NP! [Krause, Guestrin UAI ’05] ~63%

9 Carnegie Mellon Submodularity Properties and Examples

10 10 Set functions Finite set V = {1,2,…,n} Function F: 2 V ! R Will always assume F( ; ) = 0 (w.l.o.g.) Assume black-box that can evaluate F for any input A Approximate (noisy) evaluation of F is ok (e.g., [37]) Example: F(A) = IG(X A ; Y) = H(Y) – H(Y | X A ) =  y,x A P(x A ) [log P(y | x A ) – log P(y)] Y “Sick” X 1 “Fever” X 2 “Rash” F({X 1,X 2 }) = 0.9 Y “Sick” X 2 “Rash” X 3 “Male” F({X 2,X 3 }) = 0.5

11 11 Submodular set functions Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A [ B)+F(A Å B) Equivalent diminishing returns characterization: S B A S + + Large improvement Small improvement For A µ B, s  B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) Submodularity: B A A [ B AÅBAÅB + + ¸

12 12 Submodularity and supermodularity Set function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A [ B)+F(A Å B)  2) For all A µ B, s  B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) F is called supermodular if –F is submodular F is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) =  i 2 A w(i)

13 13 Example: Set cover Node predicts values of positions with some radius For A µ V: F(A) = “area covered by sensors placed at A” Want to cover floorplan with discs Place sensors in building Possible locations V

14 14 Set cover is submodular S1S1 S2S2 S1S1 S2S2 S3S3 S4S4 S’ A={S 1,S 2 } B = {S 1,S 2,S 3,S 4 } F(A [ {S’})-F(A) F(B [ {S’})-F(B) ¸

15 15 Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Who should get free cell phones? V = {Dr. Z, Dr. H, Khuong, Huy, Thanh, Nam} F(A) = Expected number of people influenced when targeting A 0.5 0.3 0.5 0.40.2 0.5 Dr. Zheng Khuong Huy Dr. Han Thanh Nam Prob. of influencing

16 16 Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03] 0.5 0.3 0.5 0.4 0.2 0.5 Key idea: Flip coins c in advance  “live” edges F c (A) = People influenced under outcome c (set cover!) F(A) =  c P(c) F c (A) is submodular as well! Dr. Zheng Huy Dr. Han Thanh Nam Khuong

17 17 Submodularity and Concavity Suppose g: N ! R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” |A| g(|A|)

18 18 Maximum of submodular functions Suppose F 1 (A) and F 2 (A) submodular. Is F(A) = max(F 1 (A),F 2 (A)) submodular? |A| F 2 (A) F 1 (A) F(A) = max(F 1 (A),F 2 (A)) max(F 1,F 2 ) not submodular in general!

19 19 Minimum of submodular functions Well, maybe F(A) = min(F 1 (A),F 2 (A)) instead? F 1 (A)F 2 (A)F(A) ; 000 {a}100 {b}010 {a,b}111 F({b}) – F( ; )=0 F({a,b}) – F({a})=1 < min(F 1,F 2 ) not submodular in general!

20 Carnegie Mellon Maximizing submodular functions

21 21 Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable! Maximizing convex functions: NP hard! Maximizing submodular functions: NP hard! But can get approximation guarantees

22 22 Maximizing non-monotonic functions Suppose we want for not monotonic F A* = argmax F(A) s.t. A µ V Example: F’(A) = F(A) – C(A) where F(A) is submodular utility, and C(A) is linear cost function E.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] In general: NP hard. Moreover: If F(A) can take negative values: As hard to approximate as maximum independent set (i.e., NP hard to get O(n 1-  ) approximation) |A| maximum

23 23 Maximizing positive submodular functions [Feige, Mirrokni, Vondrak FOCS ’07] picking a random set gives ¼ approximation we cannot get better than ¾ approximation unless P = NP Theorem There is an efficient randomized local search procedure, that, given a positive submodular function F, F( ; )=0, returns set A LS such that F(A LS ) ¸ (2/5) max A F(A)

24 24 Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: Option 1: max A F(A) – C(A) s.t. A µ V Option 2: max A F(A) s.t. C(A) · B Can get 2/5 approx… if F(A)-C(A) ¸ 0 for all A µ V coming up… Positiveness is a strong requirement  “Scalarization”“Constrained maximization”

25 25 Robust optimization Complex constraints Constrained maximization: Outline Selected set Monotonic submodular Budget Selection cost Subset selection: C(A) = |A|

26 26 Monotonicity A set function is called monotonic if A µ B µ V ) F(A) · F(B) Examples: Influence in social networks [Kempe et al KDD ’03] For discrete RVs, entropy F(A) = H(X A ) is monotonic: Suppose B=A [ C. Then F(B) = H(X A, X C ) = H(X A ) + H(X C | X A ) ¸ H(X A ) = F(A) Information gain: F(A) = H(Y)-H(Y | X A ) Set cover Matroid rank functions (dimension of vector spaces, …) …

27 27 Subset selection Given: Finite set V, monotonic submodular function F, F( ; ) = 0 Want: A * µ V such that NP-hard!

28 28 Exact maximization of monotonic submodular functions 1) Mixed integer programming [Nemhauser et al ’81] 2) Branch-and-bound: “Data-correcting algorithm” [Goldengorin et al ’99] max  s.t.  · F(B) +  s 2 V n B  s  s (B) for all B µ S  s  s · k  s 2 {0,1} where  s (B) = F(B [ {s}) – F(B) Solved using constraint generation Both algorithms worst-case exponential! M

29 29 Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A * µ V such that NP-hard! Greedy algorithm: Start with A 0 = ; For i = 1 to k s i := argmax s F(A i-1 [ {s}) - F(A i-1 ) A i := A i-1 [ {s i } Y “Sick” X 1 “Fever” X 2 “Rash” X 3 “Male” M

30 30 Performance of greedy algorithm Theorem [Nemhauser et al ‘78] Given a monotonic submodular function F, F( ; )=0, the greedy maximization algorithm returns A greedy F(A greedy ) ¸ (1-1/e) max |A| · k F(A) ~63%

31 31 Example: Submodularity of info-gain Y 1,…,Y m, X 1, …, X n discrete RVs F(A) = IG(Y; X A ) = H(Y)-H(Y | X A ) F(A) is always monotonic However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05] If X i are all conditionally independent given Y, then F(A) is submodular! Y1Y1 X1X1 Y2Y2 X2X2 Y3Y3 X4X4 X3X3 Hence, greedy algorithm works! In fact, NO algorithm can do better than (1-1/e) approximation!

32 32 People sit a lot Activity recognition in assistive technologies Seating pressure as user interface Equipped with 1 sensor per cm 2 ! Costs $16,000!  Can we get similar accuracy with fewer, cheaper sensors? Lean forward SlouchLean left 82% accuracy on 10 postures! [Tan et al] Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07]

33 33 How to place sensors on a chair? Sensor readings at locations V as random variables Predict posture Y using probabilistic model P(Y,V) Pick sensor locations A* µ V to minimize entropy: Possible locations V AccuracyCost Before82%$16,000  After79%$100 Placed sensors, did a user study: Similar accuracy at <1% of the cost!

34 34 Monitoring water networks [Krause et al, J Wat Res Mgt 2008] Contamination of drinking water could affect millions of people Contamination Place sensors to detect contaminations “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination? Sensors Simulator from EPA Hach Sensor ~$14K

35 35 Model-based sensing Utility of placing sensors based on model of the world For water networks: Water flow simulator from EPA F(A)=Expected impact reduction placing sensors at A S2S2 S3S3 S4S4 S1S1 S2S2 S3S3 S4S4 S1S1 High impact reduction F(A) = 0.9 Low impact reduction F(A)=0.01 Model predicts High impact Medium impact location Low impact location Sensor reduces impact through early detection! S1S1 Contamination Set V of all network junctions Theorem [Krause et al., J Wat Res Mgt ’08] : Impact reduction F(A) in water networks is submodular!

36 36 Battle of the Water Sensor Networks Competition Real metropolitan area network (12,527 nodes) Water flow simulator provided by EPA 3.6 million contamination events Multiple objectives: Detection time, affected population, … Place sensors that detect well “on average”

37 37 Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] (1-1/e) bound quite loose… can we get better bounds? Population protected F(A) Higher is better Water networks data 05101520 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Offline (Nemhauser) bound Greedy solution Number of sensors placed

38 38 Data dependent bounds [Minoux ’78] Suppose A is candidate solution to argmax F(A) s.t. |A| · k and A* = {s 1,…,s k } be an optimal solution Then F(A*) · F(A [ A*) = F(A)+  i F(A [ {s 1,…,s i })-F(A [ {s 1,…,s i-1 }) · F(A) +  i (F(A [ {s i })-F(A)) = F(A) +  i  s i For each s 2 V n A, let  s = F(A [ {s})-F(A) Order such that  1 ¸  2 ¸ … ¸  n Then: F(A*) · F(A) +  i=1 k  i M

39 39 Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] Submodularity gives data-dependent bounds on the performance of any algorithm Sensing quality F(A) Higher is better Water networks data 05101520 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Offline (Nemhauser) bound Data-dependent bound Greedy solution Number of sensors placed

40 40 BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] 13 participants Performance measured in 30 different criteria 0 5 10 15 20 25 30 Total Score Higher is better Krause et al. Berry et al. Dorini et al. Wu & Walski Ostfeld & Salomons Propato & Piller Eliades & Polycarpou Huang et al. Guan et al. Ghimire & Barkdoll Trachtman Gueli Preis & Ostfeld E E D D G G G G G H H H G: Genetic algorithm H: Other heuristic D: Domain knowledge E: “Exact” method (MIP) 24% better performance than runner-up!

41 41 “Lazy” greedy algorithm [Minoux ’78] Lazy greedy algorithm:  First iteration as usual  Keep an ordered list of marginal benefits  i from previous iteration  Re-evaluate  i only for top element  If  i stays on top, use it, otherwise re-sort a b c d Benefit  s (A) e a dbce a c d be Note: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] M

42 42 Simulated all on 2 weeks / 40 processors 152 GB data on disk  Very accurate computation of F(A) Using “lazy evaluations”: 1 hour/20 sensors Done after 2 days!, 16 GB in main memory (compressed) Lower is better 30 hours/20 sensors 6 weeks for all 30 settings  3.6M contaminations Very slow evaluation of F(A)  12345678910 0 100 200 300 Number of sensors selected Running time (minutes) Exhaustive search (All subsets) Naive greedy Fast greedy ubmodularity to the rescue: Result of lazy evaluation

43 43 What about worst-case? [Krause et al., NIPS ’07] S2S2 S3S3 S4S4 S1S1 Knowing the sensor locations, an adversary contaminates here! Where should we place sensors to quickly detect in the worst case? Very different average-case impact, Same worst-case impact S2S2 S3S3 S4S4 S1S1 Placement detects well on “average-case” (accidental) contamination

44 44 Robust optimization Complex constraints Constrained maximization: Outline Selected set Utility function Budget Selection cost Subset selection

45 45 Separate utility function F i for each contamination i F i (A) = impact reduction by sensors A for contamination i Want to solve Each of the F i is submodular Unfortunately, min i F i not submodular! How can we solve this robust optimization problem? Optimizing for the worst case Contamination at node s Sensors A F s (A) is high Contamination at node r F r (A) is lowF r (B) is high F s (B) is high Sensors B

46 46 How does the greedy algorithm do? Theorem [NIPS ’07] : The problem max |A| · k min i F i (A) does not admit any approximation unless P=NP Optimal solution Greedy picks first Then, can choose only or  Greedy does arbitrarily badly. Is there something better? V={,, } Can only buy k=2 Greedy score:  Optimal score: 1 Set AF1F1 F2F2 min i F i 100 020  1   2  121 Hence we can’t find any approximation algorithm. Or can we?

47 47 Alternative formulation If somebody told us the optimal value, can we recover the optimal solution A * ? Need to find Is this any easier? Yes, if we relax the constraint |A| · k

48 48 Solving the alternative problem Trick: For each F i and c, define truncation c |A| F i (A) F’ i,c (A) Same optimal solutions! Solving one solves the other Non-submodular  Don’t know how to solve Submodular! But appears as constraint? Problem 1 (last slide) Problem 2 Remains submodular!

49 49 Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q Greedy algorithm: Start with A := ; ; While F(A) < Q and |A|< n s* := argmax s F(A [ {s}) A := A [ {s*} Theorem [Wolsey et al]: Greedy will return A greedy |A greedy | · (1+log max s F({s})) |A opt | For bound, assume F is integral. If not, just round it. M

50 50 Solving the alternative problem Trick: For each F i and c, define truncation c |A| F i (A) F’ i,c (A) Non-submodular  Don’t know how to solve Submodular! Can use greedy algorithm! Problem 1 (last slide) Problem 2

51 51 Truncation threshold (color) SATURATE Algorithm [Krause et al, NIPS ‘07] Given: set V, integer k and monotonic SFs F 1,…,F m Initialize c min =0, c max = min i F i (V) Do binary search: c = (c min +c max )/2 Greedily find A G such that F’ avg,c (A G ) = c If |A G | ·  k: increase c min If |A G | >  k: decrease c max until convergence M

52 52 Theoretical guarantees [Krause et al, NIPS ‘07] Theorem: If there were a polytime algorithm with better factor  < , then NP µ DTIME(n log log n ) Theorem: SATURATE finds a solution A S such that min i F i (A S ) ¸ OPT k and |A S | ·  k where OPT k = max |A| · k min i F i (A)  = 1 + log max s  i F i ({s}) Theorem: The problem max |A| · k min i F i (A) does not admit any approximation unless P=NP 

53 53 Robust optimization Complex constraints Constrained maximization: Outline Selected set Budget Subset selection Utility function Selection cost

54 54 Other aspects: Complex constraints max A F(A) or max A min i F i (A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Locations need to be connected by paths [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07] Lake monitoring Sensors need to communicate (form a routing tree) Building monitoring

55 55 Non-constant cost functions For each s 2 V, let c(s)>0 be its cost (e.g., feature acquisition costs, …) Cost of a set C(A) =  s 2 A c(s) (modular function!) Want to solve A* = argmax F(A) s.t. C(A) · B Cost-benefit greedy algorithm: Start with A := ; ; While there is an s 2 V n A s.t. C(A [ {s}) · B A := A [ {s*} M

56 56 Performance of cost-benefit greedy Want max A F(A) s.t. C(A) · 1 Cost-benefit greedy picks a. Then cannot afford b!  Cost-benefit greedy performs arbitrarily badly! Set AF(A)C(A) {a}{a} 22 {b}{b}11

57 57 Cost-benefit optimization [Wolsey ’82, Sviridenko ’04, Leskovec et al ’07] Theorem [Leskovec et al. KDD ‘07] A CB : cost-benefit greedy solution and A UC : unit-cost greedy solution (i.e., ignore costs) Then max { F(A CB ), F(A UC ) } ¸ ½ (1-1/e) OPT Can still compute online bounds and speed up using lazy evaluations Note: Can also get (1-1/e) approximation in time O(n 4 ) [Sviridenko ’04] Slightly better than ½ (1-1/e) in O(n 2 ) [Wolsey ‘82]

58 58 Time Information cascade Example: Cascades in the Blogosphere [Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance ‘07] Which blogs should we read to learn about big cascades early? Learn about story after us!

59 59 Water vs. Web In both problems we are given Graph with nodes (junctions / blogs) and edges (pipes / links) Cascades spreading dynamically over the graph (contamination / citations) Want to pick nodes to detect big cascades early Placing sensors in water networks Selecting informative blogs vs. In both applications, utility functions submodular [Generalizes Kempe et al, KDD ’03]

60 60 Performance on Blog selection Outperforms state-of-the-art heuristics 700x speedup using submodularity! Blog selection Lower is better 12345678910 0 100 200 300 400 Number of blogs selected Running time (seconds) Exhaustive search (All subsets) Naive greedy Fast greedy Blog selection ~45k blogs Higher is better Number of blogs Cascades captured 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Greedy In-links All outlinks # Posts Random

61 61 Predicting the “hot” blogs Detects on training set Greedy on historic Test on future Poor generalization! Why’s that? 01000200030004000 0 0.05 0.1 0.15 0.2 0.25 Greedy on future Test on future “Cheating” Cascades captured Cost(A) = Number of posts / day Detect well here! Detect poorly here! Want blogs that will be informative in the future Split data set; train on historic, test on future Blog selection “overfits” to training data! Let’s see what goes wrong here. Want blogs that continue to do well! 02468101214

62 62 Robust optimization Detections using SATURATE F 1 (A)=.5F 2 (A)=.8F 3 (A)=.6F 4 (A)=.01F 5 (A)=.02 Optimize worst-case F i (A) = detections in interval i “Overfit” blog selection A “Robust” blog selection A* Robust optimization  Regularization!

63 63 Predicting the “hot” blogs Greedy on historic Test on future Robust solution Test on future 01000200030004000 0 0.05 0.1 0.15 0.2 0.25 Greedy on future Test on future “Cheating” Cascades captured Cost(A) = Number of posts / day 50% better generalization! 02468101214


Download ppt "Carnegie Mellon Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University."

Similar presentations


Ads by Google