Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech
Submodular functions V={1,2, …, n} f : 2 V ! R Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|) Vector Spaces Let V={v 1, ,v n }, each v i 2 R n. For each S µ V, let f(S) = rank(V[S]) Examples: f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,T µ V Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8 S µ T µ V, x T Submodularity: Equivalent
Submodular functions V={1,2, …, n} f : 2 V ! R f(S) · f(T), 8 S µ T f(S) ¸ 0, 8 S µ V Non-negative: Monotone: f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,T µ V Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8 S µ T µ V, x T Submodularity: Equivalent
Submodular functions Strong connection between optimization and submodularity e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…], maximization [NWF’78,V’07,…] Much interest in Machine Learning community recently Tutorials at major conferences: ICML, NIPS, etc. is a Machine Learning site Algorithmic game theory Submodular utility functions Interesting to understand their learnability
Algorithm adaptively queries x i and receives value f(x i ), for i=1,…,q, where q=poly(n). Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Goal: g(x) · f(x) · ® ¢ g(x) 8 x 2 {0,1} n ® as small as possible f : {0,1} n R Algorithm f(x 1 ) g : {0,1} n R x1x1 Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni SODA 2009
Algorithm adaptively queries x i and receives value f(x i ), for i=1,…,q Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Goal: g(x) · f(x) · ® ¢ g(x) 8 x 2 {0,1} n ® as small as possible Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni SODA an alg. for learning a submodular function with ® = O(n 1/2 ). Theorem: (Upper bound) ~ Any alg. for learning a submodular function must have ® = (n 1/2 ). Theorem: (Lower bound) ~
Problems with this model In learning theory, usually only try to predict value of most points GHIM lower bound fails if goal is to do well on most of the points To define “most” need a distribution on {0,1} n Is there a distributional model for learning submodular functions?
Distribution D on {0,1} n Our Model Algorithm sees examples (x 1,f(x 1 )),…, (x q,f(x q )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) f : {0,1} n R + Algorithm xixi f(x i ) g : {0,1} n R +
Distribution D on {0,1} n Our Model Algorithm sees examples (x 1,f(x 1 )),…, (x q,f(x q )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Pr x 1,…,x q [ Pr x [g(x) · f(x) · ® ¢ g(x)] ¸ 1- ² ] ¸ 1- ± “Probably Mostly Approximately Correct” f : {0,1} n R + Algorithm x g : {0,1} n R + Is f(x) ¼ g(x)?
Distribution D on {0,1} n Our Model “Probably Mostly Approximately Correct” Impossible if f arbitrary and # training points ¿ 2 n Possible if f is a non-negative, monotone, submodular function f : {0,1} n R + Algorithm x g : {0,1} n R + Is f(x) ¼ g(x)?
Example: Concave Functions Concave Functions Let h : R ! R be concave. h
; V Example: Concave Functions Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|). Claim: f is submodular. We prove a partial converse.
Theorem: Every submodular function looks like this. Lots of approximately usually. ; V
Theorem: Every submodular function looks like this. Lots of approximately usually. Theorem: Let f be a non-negative, monotone, submodular, 1-Lipschitz function. There exists a concave function h : [0,n] ! R s.t., for any ² >0, for every k 2 {0,..,n}, and for a 1- ² fraction of S µ V with |S|=k, we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality. h(k) · f(S) · O(log 2 (1/ ² )) ¢ h(k). ; V matroid rank function
Learning Submodular Functions under any product distribution Product Distribution D on {0,1} n f : {0,1} n R + Algorithm xixi f(x i ) g : {0,1} n R + Algorithm: Let ¹ = § i =1 f(x i ) / q Let g be the constant function with value ¹ This achieves approximation factor O(log 2 (1/ ² )) on a 1- ² fraction of points, with high probability. Proof: Essentially follows from previous theorem. q
Learning Submodular Functions under an arbitrary distribution? Same argument no longer works. Talagrand’s inequality requires a product distribution. Intuition: A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points. ; V
A General Upper Bound? Theorem: (Our upper bound) 9 an algorithm for learning a submodular function w.r.t. an arbitrary distribution that has approximation factor (n 1/2 ).
Computing Linear Separators + – – – – – + – + + – – – Given {+,–}-labeled points in R n, find a hyperplane c T x = b that separates the +s and –s. Easily solved by linear programming.
Learning Linear Separators + – – – – – + – + + – – – Given random sample of {+,–}-labeled points in R n, find a hyperplane c T x = b that separates most of the +s and –s. Classic machine learning problem. Error!
Learning Linear Separators + – – – – – + – + + – – – Classic Theorem: [Vapnik-Chervonenkis 1971?] O( n/ ² 2 ) samples suffice to get error ². Error! ~
Submodular Functions are Approximately Linear Let f be non-negative, monotone and submodular Claim: f can be approximated to within factor n by a linear function g. Proof Sketch: Let g(S) = § s 2 S f({s}). Then f(S) · g(S) · n ¢ f(S). Submodularity: f(S)+f(T) ¸ f(S Å T)+f(S [ T) 8 S,T µ V Monotonicity: f(S) · f(T) 8 S µ T Non-negativity: f(S) ¸ 0 8 S µ V
V Submodular Functions are Approximately Linear f n¢fn¢f g
V f n¢fn¢f Randomly sample {S 1,…,S q } from distribution Create + for f(S i ) and – for n ¢ f(S i ) Now just learn a linear separator! – – – – – – – – – – – – – – g
V f n¢fn¢f Theorem: g approximates f to within a factor n on a 1- ² fraction of the distribution. Can improve to factor O(n 1/2 ) by GHIM lemma: ellipsoidal approximation of submodular functions. g
A Lower Bound? A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points Can we create a submodular function with lots of deep “bumps”? Yes! ; V
A General Lower Bound Plan: Use the fact that matroid rank functions are submodular. Construct a hard family of matroids. Pick A 1,…,A m ½ V with |A i | = n 1/3 and m=n log n A1A1 A2A2 ALAL A3A3 X X X Low=log 2 n High=n 1/3 X … … …. …. No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factor õ(n 1/3 ). Theorem: (Our general lower bound)
Matroids Ground Set V Family of Independent Sets I Axioms: ; 2 I “nonempty” J ½ I 2 I ) J 2 I “downwards closed” J, I 2 I and |J|<|I| ) 9 x 2 I n J s.t. J+x 2 I “maximum-size sets can be found greedily” Rank function: r(S) = max { |I| : I 2I and I µ S }
f(S) = min{ |S|, k } r(S) = |S|(if |S| · k) k(otherwise) ; V
; V r(S) = |S|(if |S| · k) k-1(if S=A) k(otherwise) A
; V r(S) = |S|(if |S| · k) k-1(if S 2 A ) k(otherwise) A1A1 A2A2 A3A3 AmAm A = {A 1, ,A m }, |A i |=k 8 i Claim: r is submodular if |A i Å A j | · k-2 8 i j r is the rank function of a “paving matroid”
; V r(S) = |S|(if |S| · k) k-1(if S 2 A ) k(otherwise) A1A1 A2A2 A3A3 AmAm A = {A 1, ,A m }, |A i |=k 8 i, |A i Å A j | · k-2 8 i j
; V r(S) = |S| (if |S| · k) k-1 (if S 2 A and wasn’t deleted) k (otherwise) A1A1 A3A3 Delete half of the bumps at random. If m large, alg. cannot learn which were deleted ) any algorithm to learn f has additive error 1 If algorithm sees only these examples Then f can’t be predicted here A2A2 AmAm
; V A1A1 A3A3 Can we force a bigger error with bigger bumps? Yes! Need to generalize paving matroids A needs to have very strong properties AmAm A2A2
The Main Question Let V = A 1 [ [ A m and b 1, ,b m 2 N Is there a matroid s.t. r(A i ) · b i 8 i r(S) is “as large as possible” for S A i (this is not formal) If A i ’s are disjoint, solution is partition matroid If A i ’s are “almost disjoint”, can we find a matroid that’s “almost” a partition matroid? Next: formalize this
Lossless Expander Graphs Definition: G =(U [ V, E) is a (D,K, ² )-lossless expander if – Every u 2 U has degree D – | ¡ (S)| ¸ (1- ² ) ¢ D ¢ |S| 8 S µ U with |S| · K, where ¡ (S) = { v 2 V : 9 u 2 S s.t. {u,v} 2 E } “Every small left-set has nearly-maximal number of right-neighbors” UV
Lossless Expander Graphs Definition: G =(U [ V, E) is a (D,K, ² )-lossless expander if – Every u 2 U has degree D – | ¡ (S)| ¸ (1- ² ) ¢ D ¢ |S| 8 S µ U with |S| · K, where ¡ (S) = { v 2 V : 9 u 2 S s.t. {u,v} 2 E } “Neighborhoods of left-vertices are K-wise-almost-disjoint” UV
Trivial Case: Disjoint Neighborhoods Definition: G =(U [ V, E) is a (D,K, ² )-lossless expander if – Every u 2 U has degree D – | ¡ (S)| ¸ (1- ² ) ¢ D ¢ |S| 8 S µ U with |S| · K, where ¡ (S) = { v 2 V : 9 u 2 S s.t. {u,v} 2 E } If left-vertices have disjoint neighborhoods, this gives an expander with ² =0, K= 1 UV
Main Theorem: Trivial Case Suppose G =(U [ V, E) has disjoint left-neighborhoods. Let A ={A 1,…,A m } be defined by A = { ¡ (u) : u 2 U }. Let b 1, …, b m be non-negative integers. Theorem: is family of independent sets of a matroid. A1A1 A2A2 · b1· b1 · b2· b2 U V Partition matroid u1u1 u2u2 u3u3
Main Theorem Let G =(U [ V, E) be a (D,K, ² )-lossless expander Let A ={A 1,…,A m } be defined by A = { ¡ (u) : u 2 U } Let b 1, …, b m satisfy b i ¸ 4 ² D 8 i A1A1 · b1· b1 A2A2 · b2· b2
Main Theorem Let G =(U [ V, E) be a (D,K, ² )-lossless expander Let A ={A 1,…,A m } be defined by A = { ¡ (u) : u 2 U } Let b 1, …, b m satisfy b i ¸ 4 ² D 8 i “Desired Theorem”: I is a matroid, where
Main Theorem Let G =(U [ V, E) be a (D,K, ² )-lossless expander Let A ={A 1,…,A m } be defined by A = { ¡ (u) : u 2 U } Let b 1, …, b m satisfy b i ¸ 4 ² D 8 i Theorem: I is a matroid, where
Main Theorem Let G =(U [ V, E) be a (D,K, ² )-lossless expander Let A ={A 1,…,A m } be defined by A = { ¡ (u) : u 2 U } Let b 1, …, b m satisfy b i ¸ 4 ² D 8 i Theorem: I is a matroid, where Trivial case: G has disjoint neighborhoods, i.e., K= 1 and ² =0. = 0 = 1 = 0 = 1
LB for Learning Submodular Functions ; V A2A2 A1A1 How deep can we make the “valleys”? n 1/3 log 2 n
LB for Learning Submodular Functions Let G =(U [ V, E) be a (D,K, ² )-lossless expander, where A i = ¡ (u i ) and – |V|=n − |U|=n log n – D = K = n 1/3 − ² = log 2 (n)/n 1/3 Such graphs exist by the probabilistic method Lower Bound Proof: – Delete each node in U with prob. ½, then use main theorem to get a matroid – If u i 2 U was not deleted then r(A i ) · b i = 4 ² D = O(log 2 n) – Claim: If u i deleted then A i 2 I (Needs a proof) ) r(A i ) = |A i | = D = n 1/3 – Since # A i ’s = |U| = n log n, no algorithm can learn a significant fraction of r(A i ) values in polynomial time
Summary PMAC model for learning real-valued functions Learning under arbitrary distributions: – Factor O(n 1/2 ) algorithm – Factor (n 1/3 ) hardness (info-theoretic) Learning under product distributions: – Factor O(log(1/ ² )) algorithm New general family of matroids – Generalizes partition matroids to non-disjoint parts
Open Questions Improve (n 1/3 ) lower bound to (n 1/2 ) Explicit construction of expanders Non-monotone submodular functions – Any algorithm? – Lower bound better than (n 1/3 ) For algorithm under uniform distribution, relax 1-Lipschitz condition