Learning Submodular Functions Maria Florina Balcan LGO, 11/16/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
Submodular functions V={1,2, …, n}; set-function f : 2V ! R f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,Tµ V Decreasing marginal return f(T [ {x})-f(T)¸ f(S [ {x})-f(S), 8 S,Tµ V, T µ S, x not in T Examples: Vector Spaces Let V={v1,,vn}, each vi 2 Fn. For each S µ V, let f(S) = rank(V[S]) Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|)
Submodular set functions Set function f on V is called submodular if For all S,T µ V: f(S)+f(T) ¸ f(S[T)+f(SÅT) Equivalent diminishing returns characterization: + ¸ + S [ T S T SÅT S + x Large improvement Submodularity: T + x Small improvement For TµS, xS, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S)
Example: Set cover Want to cover floorplan with discs Place sensors in building Possible locations V For S µ V: f(S) = “area (# locations) covered by sensors placed at S” Node predicts values of positions with some radius Formally: W finite set, collection of n subsets Wi µ W For S µ V={1,…,n} define f(S) = |i2 S Wi|
Set cover is submodular T={W1,W2} W1 W2 x f(T[{x})-f(T) ¸ f(S[{x})-f(S) W1 W2 W3 x W4 S = {W1,W2,W3,W4}
Submodular functions V={1,2, …, n}; set-function f : 2V ! R f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,Tµ V Decreasing marginal return f(T [ {x})-f(T)· f(S [ {x})-f(S), 8 S,Tµ V, S µ T, x not in T Examples: Vector Spaces Let V={v1,,vn}, each vi 2 Fn. For each S µ V, let f(S) = rank(V[S]) Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|)
Submodular functions V={1,2, …, n}; set-function f : 2V ! R f(S)+f(T) ¸ f(S Å T) + f(S [ T), 8 S,Tµ V Monotone: f(S) · f(T) , 8 S µ T Non-negative: f(S) ¸ 0, 8 S µ V
Submodular functions A lot of work on optimization and submodularity. Can be minimized in polynomial time. Algorithmic game theory decreasing marginal utilities. Substantial interest in the ML community recently. Tutorials, workshops at ICML, NIPS, etc. www.submodularity.org/ owned by ML.
Learnability of Submodular Fns Important to also understand their learnability. Previous Work: Exact learning with value queries Goemans, Harvey, Iwata, Mirrokni, SODA 2009 [GHIM’09] Model There is an unknown submodular target function. Algorithm allowed to (adaptively) pick sets and query the value of the target on those sets. Can we learn the target with a polynomial number of queries in poly time? Output a function that approximates the target within a factor of ® on every single subset.
Exact learning with value queries Goemans, Harvey, Iwata, Mirrokni, SODA 2009 Theorem: (General upper bound) 9 an alg. for learning a submodular function with an approx. factor O(n1/2). Theorem: (General lower bound) Any alg. for learning a submodular must have an approx. factor of (n1/2).
Problems with the GHIM model - Lower bound fails if our goal is to do well on most of the points. - Many simple functions that are easy to learn in the PAC model (e.g., conjunctions) are impossible to get exactly from a poly number of queries - Well known that value queries are undesirable in some learning applications. Is there a better model that gets around these problems?
Problems with the GHIM model - Lower bound fails if our goal is to do well on most of the points. - Many simple functions that are easy to learn in the PAC model (e.g., conjunctions) are impossible to get exactly from a poly number of queries - Well known that value queries are undesirable in some learning applications. Learning submodular fns in a distributional learning setting [BH10]
Our model: Passive Supervised Learning Data Source Distribution D on {0,1}n Expert / Oracle Learning Algorithm Labeled Examples (x1,f(x1)),…, (xk,f(xk)) f : {0,1}n R+ Alg.outputs g : {0,1}n R+
Our model: Passive Supervised Learning Distribution D on {0,1}n Labeled Examples Learning Algorithm Expert / Oracle Data Source Alg.outputs f : {0,1}n R+ g : {0,1}n R+ (x1,f(x1)),…, (xk,f(xk)) Algorithm sees (x1,f(x1)),…, (xk,f(xk)), xi i.i.d. from D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Prx1,…,xm[ Prx[g(x) · f(x)· ® g(x)]¸ 1-²] ¸ 1-± “Probably Mostly Approximately Correct”
Main results Theorem: (Our general upper bound) 9 an alg. for PMAC-learning the class of non-negative, monotone, submodular fns (w.r.t. an arbitrary distribution) with an approx. factor O(n1/2). Note: Much simpler alg. compared to GIHM’09 Theorem: (Our general lower bound) No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factor õ(n1/3). Note: The GIHM’09 lower bound fails in our model. Theorem: (Product distributions) Matroid rank functions, const. approx.
A General Upper Bound Theorem: 9 an alg. for PMAC-learning the class of non-negative, monotone, submodular fns (w.r.t. an arbitrary distribution) with an approx. factor O(n1/2).
Subaddtive Fns are Approximately Linear Let f be non-negative, monotone and subadditive Claim: f can be approximated to within factor n by a linear function g. Proof Sketch: Let g(S) = s in S f({s}). Then f(S) · g(S) · n ¢ f(S). Subadditive: f(S)+f(T) ¸ f(S[ T) 8 S,T µ V Monotonicity: f(S) · f(T) 8 Sµ T Non-negativity: f(S) ¸ 0 8 S µ V
Subaddtive Fns are Approximately Linear f(S) · g(S) · n¢f(S). n¢f g f V
PMAC Learning Subadditive Fns f non-negative, monotone, subadditive approximated to within factor n by a linear function g, g (S) =w ¢ Â (S). Labeled examples ((Â(S), f(S) ), +) and ((Â(S), n¢f(S) ), -) are linearly separable in Rn+1. Idea: learn a linear separator. Use std sample complex. Problem: data not i.i.d. Solution: create a related distribution. w chi(S) – f(S) >0 w chi(S) – (n+1) f(S)<0 w chi(S) – z f(S) >0 W chi(S) – z(n+1) f(S) <0 === Sample S from D; flip a coin. If heads add ((Â(S), f(S) ), +). Else add ((Â(S), n¢f(S) ), -).
PMAC Learning Subadditive Fns Algorithm: Note: Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) For each Si, flip a coin. If heads add ((Â(S), f(Si) ), +). Else add ((Â(S), n¢f(Si) ), -). Learn a linear separator u=(w,-z) in Rn+1. Output: g(S)=1/(n+1) w ¢ Â (S). Theorem: For m = £(n/²), g approximates f to within a factor n on a 1-² fraction of the distribution.
PMAC Learning Submodular Fns Algorithm: Note: Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) For each Si, flip a coin. If heads add ((Â(S), f2(S_i)) ), +). Else add ((Â(S), n f2(S_i) ), -). Learn a linear separator u=(w,-z) in Rn+1. Output: g(S)=1/(n+1)1/2 w ¢ Â (S) Theorem: For m = £(n/²), g approximates f to within a factor \sqrt{n} on a 1-² fraction of the distribution. Proof idea: f non-negative, monotone, submodular approximated to within factor \sqrt{n} by a \sqrt{linear function}. [GHIM, 09]
PMAC Learning Submodular Fns Algorithm: Note: Deal with the set {S:f(S)=0 } separately. Input: (S1, f(S1)) …, (Sm, f(Sm)) For each Si, flip a coin. If heads add ((Â(S), f2(S_i)) ), +). Else add ((Â(S), n f2(S_i) ), -). Learn a linear separator u=(w,-z) in Rn+1. Output: g(S)=1/(n+1)1/2 w ¢ Â (S) Much simpler than [GIHM09]. More robust to variations. the target only needs to be within an ¯ factor of a submodular fnc. 9 a submodular fnc that agrees with target on all but a ´ fraction of the points (on the points it disagrees it can be arbitrarily far). [the alg is inefficient in this case]
A General Lower Bound Theorem: (Our general lower bound) No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factor õ(n1/3). Plan: Use the fact that any matroid rank fnc is submodular. Construct a hard family of matroid rank functions. High=n1/3 X X L=nlog log n Low=log2n X X A1 A2 A3 … … …. …. AL
Matroids (V,Ind) is a matroid: Subsets of independent sets are independent. If I,J are independent and |I| < |J| then I U {j} is independent for some j in J. Rank(S)= max{|I|, I 2 Ind, I µ S}, for any S µ V Examples: Uniform Matroids V={1,2, …, n}, Ind={I µ V: |I | · k} Graphical Matroid: Elements are the edges of an undirected graph G = (V;E) Set of edges is independent if it does not contain a cycle
Ind={I: |I Å Aj| · uj, for all j } Partition Matroids A1, A2, …, Ak µ V={1,2, …, n}, all disjoint; ui · |Ai|-1 Ind={I: |I Å Aj| · uj, for all j } Then (V, Ind) is a matroid. If sets Ai are not disjoint, then (V,Ind) might not be a matroid. E.g., n=5, A1={1,2,3}, A2={3,4,5}, u1=u2=2. {1,2,4,5} and {2,3,4} both maximal sets in Ind; do not have the same cardinality.
Almost partition matroids k=2, A1, A2 µ V (not necessarily disjoint); ui · |Ai|-1 Ind={I: |I Å Aj| · uj , |I Å (A1 [ A2)| · u1 +u2 - |A1 Å A2|} Then (V,Ind) is a matroid.
Almost partition matroids More generally A1, A2, …, Ak µ V={1,2, …, n}, ui · |Ai|-1; f : 2[k] ! Z =<0 f(J)= j 2 J uj +|A(J)|-j 2 J|Aj|, 8 J µ [k] Ind= { I: |I Å A(J)| · f(J), 8 J µ [k] } Then (V, Ind) is a matroid (if nonempty). Rewrite f, f(J)=|A(J)|-j 2 J(|Aj| - uj), 8 J µ [k]
A generalization of partition matroids More generally f : 2[k] ! Z f(J)=|A(J)|-j 2 J(|Aj| - uj), 8 J µ [k] Ind= { I: |I Å A(J)| · f(J), 8 J µ [k] } Then (V, Ind) is a matroid (if nonempty). Proof technique: Uncrossing argument For a set I, define T(I) to be the set of tight constraints T(I)= {J µ [k], |I Å A(J)|=f(J)} 8 I 2 Ind, J1, J2 2 T(I), then (J1 [ J2 2 T(I)) or (J1 Å J2 =) Ind is the family of independent sets of a matroid.
A generalization of almost partition matroids f : 2[k] ! Z, f(J)=|A(J)|-j 2 J(|Aj| -uj), 8 J µ [k]; ui · |Ai|-1 Note: This requires k· n (for k > n, f becomes negative) But we want k=n^{log log n}. Do some sort of truncation to allow k>>n. f(J) is (¹, ¿) good if f(J) ¸ 0 for J µ [k], |J| · ¿ and f(J) ¸ ¹ for J µ [k], ¿ ·|J| · 2¿ -2 h(J)=f(J) if |J| · ¿ and h(J)=¹, otherwise. Ind= { I: |I Å A(J)| · h(J), 8 J µ [k] } Then (V,Ind) is a matroid (if nonempty).
A generalization of partition matroids Let L = nlog log n. Let A1, A2, …, AL be random subsets of V. (Ai -- include each elem of V indep with prob n-2/3. ) Let ¹=n^{1/3} log2 n, u=log2 n, ¿=n1/3 Each subset J µ {1,2, …, L} induces a matroid s.t. for any i not in J, Ai is indep in this matroid Rank(Ai), i not in J, is roughly |Ai| (i.e., £(n^{1/3})), The rank of sets Aj, j in J is u=log2 n. High=n1/3 X X L=nlog log n Low=log2n X X A1 A2 A3 … … …. …. AL
Product distributions, Matroid Rank Fns Talagrand implies: Let D be a product distribution on V, R=rank(X), X drawn from D. If E[R] ¸ 4000, If E[R]· 500 log(1/²), Related Work: [Chekuri, Vondrak ’09] and [Vondrak ’10] prove a slightly more general result by two different techniques
Product distributions, Matroid Rank Fns Talagrand implies: Let D be a product distribution on V, R=rank(X), X drawn from D. If E[R] ¸ 4000, If E[R]· 500 log(1/²), Algorithm: Let ¹= i=1m f (xi) / m Let g be the constant function with value ¹ This achieves approximation factor O(log(1/²)) on a 1-² fraction of points, with high probability.
Conclusions and Open Questions Analyze intrinsic learnability of submodular fns Our analysis reveals interesting novel extremal and structural properties of submodular fns. Open questions Improve (n1/3) lower bound to (n1/2) Non-monotone submodular functions
Other interesting structural properties Let h : R ! R+ be concave, non-decreasing. For each Sµ V, let f(S) = h(|S|) Claim: These functions f are submodular, monotone, non-negative. ; V
Lots of Theorem: Every submodular function looks like this. approximately usually. V ;
Theorem: Every submodular function looks like this. Lots of approximately usually. V ; Theorem Let f be a non-negative, monotone, submodular, 1-Lipschitz function. For any ²>0, there exists a concave function h : [0,n] ! R s.t. for every k2[0,n], and for a 1-² fraction of SµV with |S|=k, we have: h(k) · f(S) · O(log2(1/²))¢h(k). In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality.
Conclusions and Open Questions Analyze intrinsic learnability of submodular fns Our analysis reveals interesting novel extremal and structural properties of submodular fns. Open questions Improve (n1/3) lower bound to (n1/2) Non-monotone submodular functions Any algorithm? Lower bound better than (n1/3)