Learning Submodular Functions Nick Harvey, Waterloo C&O Joint work with Nina Balcan, Georgia Tech.

Slides:



Advertisements
Similar presentations
Randomness Conductors Expander Graphs Randomness Extractors Condensers Universal Hash Functions
Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Linear-Degree Extractors and the Inapproximability of Max Clique and Chromatic Number David Zuckerman University of Texas at Austin.
Extracting Randomness From Few Independent Sources Boaz Barak, IAS Russell Impagliazzo, UCSD Avi Wigderson, IAS.
Submodular Set Function Maximization A Mini-Survey Chandra Chekuri Univ. of Illinois, Urbana-Champaign.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
C&O 355 Lecture 23 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.
COMP 553: Algorithmic Game Theory Fall 2014 Yang Cai Lecture 21.
The zigzag product, Expander graphs & Combinatorics vs. Algebra Avi Wigderson IAS & Hebrew University ’00 Reingold, Vadhan, W. ’01 Alon, Lubotzky, W. ’01.
Dependent Randomized Rounding in Matroid Polytopes (& Related Results) Chandra Chekuri Jan VondrakRico Zenklusen Univ. of Illinois IBM ResearchMIT.
1 List Coloring and Euclidean Ramsey Theory TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A Noga Alon, Tel Aviv.
Visual Recognition Tutorial
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo Department of Combinatorics and Optimization Joint work with Isaac.
Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech.
Algorithms for Max-min Optimization
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo C&O Joint work with Isaac Fung TexPoint fonts used in EMF. Read.
Machine Learning Week 2 Lecture 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Derandomization: New Results and Applications Emanuele Viola Harvard University March 2006.
1 Introduction to Linear and Integer Programming Lecture 9: Feb 14.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Constant Degree, Lossless Expanders Omer Reingold AT&T joint work with Michael Capalbo (IAS), Salil Vadhan (Harvard), and Avi Wigderson (Hebrew U., IAS)
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Crossing Lemma - Part I1 Computational Geometry Seminar Lecture 7 The “Crossing Lemma” and applications Ori Orenbach.
Visual Recognition Tutorial
Query Lower Bounds for Matroids via Group Representations Nick Harvey.
Packing Element-Disjoint Steiner Trees Mohammad R. Salavatipour Department of Computing Science University of Alberta Joint with Joseph Cheriyan Department.
1 Spanning Tree Polytope x1 x2 x3 Lecture 11: Feb 21.
Simulating independence: new constructions of Condensers, Ramsey Graphs, Dispersers and Extractors Boaz Barak Guy Kindler Ronen Shaltiel Benny Sudakov.
Learning Submodular Functions
Fractional decompositions of dense hypergraphs Raphael Yuster University of Haifa.
C&O 355 Lecture 2 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A.
C&O 355 Mathematical Programming Fall 2010 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
C&O 355 Mathematical Programming Fall 2010 Lecture 4 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Approximating Submodular Functions Part 2 Nick Harvey University of British Columbia Department of Computer Science July 12 th, 2015 Joint work with Nina.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
1 Asymptotically optimal K k -packings of dense graphs via fractional K k -decompositions Raphael Yuster University of Haifa.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
1 Decomposition into bipartite graphs with minimum degree 1. Raphael Yuster.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.
Submodular Set Function Maximization A Mini-Survey Chandra Chekuri Univ. of Illinois, Urbana-Champaign.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
Iterative Rounding in Graph Connectivity Problems Kamal Jain ex- Georgia Techie Microsoft Research Some slides borrowed from Lap Chi Lau.
Hardness of Hyper-Graph Coloring Irit Dinur NEC Joint work with Oded Regev and Cliff Smyth.
Presented by Alon Levin
Approximation Algorithms based on linear programming.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Complexity Theory and Explicit Constructions of Ramsey Graphs Rahul Santhanam University of Edinburgh.
Information Complexity Lower Bounds
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
Reconstruction on trees and Phylogeny 1
Instructor: Shengyu Zhang
The Curve Merger (Dvir & Widgerson, 2008)
The Zig-Zag Product and Expansion Close to the Degree
Lecture 15: Least Square Regression Metric Embeddings
Presentation transcript:

Learning Submodular Functions Nick Harvey, Waterloo C&O Joint work with Nina Balcan, Georgia Tech

Distribution D on {0,1} n Computational Learning Theory Algorithm sees examples (x 1,f(x 1 )),…, (x m,f(x m )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) f : {0,1} n  {0,1} Algorithm xixi f(x i ) g : {0,1} n  {0,1} Training Phase xixi

Distribution D on {0,1} n Algorithm sees examples (x 1,f(x 1 )),…, (x m,f(x m )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Goal: Pr x 1,…,x m [ Pr x [f(x)=g(x)] ¸ 1- ² ] ¸ 1- ± Algorithm is “Probably Approximately Correct” f : {0,1} n  {0,1} Algorithm x g : {0,1} n  {0,1} Is f(x) = g(x)? “Probably Mostly Correct” Computational Learning Theory Testing Phase x

f = Distribution D on {0,1} n Probably Mostly Correct Model Impossible if f arbitrary and # training points ¿ 2 n f : {0,1} n  {0,1} Algorithm x g : {0,1} n  {0,1} Is f(x) = g(x)? Computational Learning Theory Random NoiseToo Unstructured or

Distribution D on {0,1} n Probably Mostly Correct Model Impossible if f arbitrary and # training points ¿ 2 n Learning possible if f is structured: eg, k-CNF formula, intersection of halfspaces in R k, constant-depth Boolean circuits, … f : {0,1} n  {0,1} Algorithm x g : {0,1} n  {0,1} Is f(x) = g(x)? Computational Learning Theory

Distribution D on {0,1} n Our Model Algorithm sees examples (x 1,f(x 1 )),…, (x m,f(x m )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) f : {0,1} n  R + Algorithm xixi f(x i ) g : {0,1} n  R +

Distribution D on {0,1} n Our Model Algorithm sees examples (x 1,f(x 1 )),…, (x k,f(x k )) where x i ’s are i.i.d. from distribution D Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Pr x 1,…,x m [ Pr x [g(x) · f(x) · ® ¢ g(x)] ¸ 1- ² ] ¸ 1- ± “Probably Mostly Approximately Correct” f : {0,1} n  R + Algorithm x g : {0,1} n  R + Is f(x) ¼ g(x)?

Distribution D on {0,1} n Our Model “Probably Mostly Approximately Correct” Impossible if f arbitrary and # training points ¿ 2 n Can we learn f if it has “nice structure”? – Linear functions: Trivial… – Convex functions: Meaningless since domain is {0,1} n – Submodular functions: Closely related to convexity f : {0,1} n  R + Algorithm x g : {0,1} n  R + Is f(x) ¼ g(x)?

Functions We Study Matroids f(S) = rank M (S) where M is a matroid Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|) Wireless Base Stations where clients want to receive data, but not too much data, and the clients can’t receive data if the base station sends too much data at once. [Chekuri ‘10] Submodularity: f(S)+f(T) ¸ f(S Å T)+f(S [ T) 8 S,T µ V Monotonicity: f(S) · f(T) 8 S µ T Non-negativity: f(S) ¸ 0 8 S µ V

Example: Concave Functions Concave Functions Let h : R ! R be concave. h

; V Example: Concave Functions Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|). Claim: f is submodular. We prove a partial converse.

Theorem: Every submodular function looks like this. Lots of approximately usually. ; V

Theorem: Every submodular function looks like this. Lots of approximately usually. Theorem: Let f be a non-negative, monotone, submodular, 1-Lipschitz function. There exists a concave function h : [0,n] ! R s.t., for any ² >0, for every k 2 [0,n], and for a 1- ² fraction of S µ V with |S|=k, we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality. h(k) · f(S) · O(log 2 (1/ ² )) ¢ h(k). ; V  matroid rank function

Learning Submodular Functions under any product distribution Product Distribution D on {0,1} n f : {0,1} n  R + Algorithm xixi f(x i ) g : {0,1} n  R + Algorithm: Let ¹ = § i =1 f(x i ) / m Let g be the constant function with value ¹ This achieves approximation factor O(log 2 (1/ ² )) on a 1- ² fraction of points, with high probability. Proof: Essentially follows from previous theorem. m

Learning Submodular Functions under an arbitrary distribution? Same argument no longer works. Talagrand’s inequality requires a product distribution. Intuition: A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points. ; V

Learning Submodular Functions under an arbitrary distribution? Intuition: A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points. Is there a lower bound? Can we create a submodular function with lots of deep “bumps”? ; V

f(S) = min{ |S|, k } f(S) = |S|(if |S| · k) k(otherwise) ; V

; V f(S) = |S|(if |S| · k) k-1(if S=A) k(otherwise) A

; V f(S) = |S|(if |S| · k) k-1(if S 2 A ) k(otherwise) A1A1 A2A2 A3A3 AkAk A = {A 1, ,A m }, |A i |=k Claim: f is submodular if |A i Å A j | · k-2 8 i  j f is the rank function of a “paving matroid” [Rota?]

; V f(S) = |S| (if |S| · k) k-1 (if S 2 A ) k (otherwise) A1A1 A2A2 A3A3 AkAk A is a weight-k error-correcting code of distance 4. It can be very large, e.g., m = 2 n /n 4.

; V f(S) = |S| (if |S| · k) k-1 (if S 2 A and wasn’t deleted) k (otherwise) A1A1 A3A3 Delete half of the bumps at random. Then f is very unconcentrated on A ) any algorithm to learn f has additive error 1 If algorithm sees only these examples Then f can’t be predicted here A2A2 AkAk

Suppose we have map ½ : 2 T ! 2 V ; V f = View as a Reduction Domain 2 T ½

Suppose we have map ½ : 2 T ! 2 V and ® < ¯ s.t. for every Boolean function f : 2 T ! {0,1} there is a submodular function f : 2 V ! R + s.t. f(S)=0 ) f( ½ (S))= ® (a “bump”) f(S)=1 ) f( ½ (S))= ¯ (a “non-bump”) Claim: If f cannot be learned, then any algorithm for learning f must have error ¯ / ® (under the uniform distribution on A = ½ (2 T )) ; V A1A1 A3A3 A2A2 AkAk f View as a Reduction ~ ~ ~ ~ ~ f = ½ Domain 2 T

Suppose we have map ½ : 2 T ! 2 V and ® < ¯ s.t. for every Boolean function f : 2 T ! {0,1} there is a submodular function f : 2 V ! R + s.t. f(S)=0 ) f( ½ (S))= ® (a “bump”) f(S)=1 ) f( ½ (S))= ¯ (a “non-bump”) By Paving Matroids: A map ½ exists with |V|=n, |T|=  (n), ® =n/2, and ¯ =n/2+1. ) additive error 1 to learn submodular functions View as a Reduction ~ ~ ~ ; V A1A1 A3A3 A2A2 AkAk f ~ ½ f = Domain 2 T

; V A1A1 A3A3 Can we force a bigger error with bigger bumps? Yes! Need A to be an extremely strong error-correcting code AkAk A2A2

Expander Graphs Let G=(U [ V, E) be a bipartite graph Definition: G is a ° -expander if (vertex expander) | ¡ (S)| ¸ ° ¢ |S| 8 S µ U where ¡ (S) = { v : 9 u 2 S s.t. {u,v} 2 E } Every set S µ U has at least ° ¢ |S| neighbours G has a perfect matching, and thus is a 1-expander

Let G=(U [ V, E) be a bipartite graph Revised Definition: G is a (K, ° )-expander if (vertex expander) | ¡ (S)| ¸ ° ¢ |S| 8 S µ U s.t. |S| · K where ¡ (S) = { v : 9 u 2 S s.t. {u,v} 2 E } Every small set S µ U has at least ° ¢ |S| neighbours Expander Graphs

Probabilistic Construction Revised Definition: G is a (K, ° )-expander if | ¡ (S)| ¸ ° ¢ |S| 8 S µ U s.t. |S| · K Theorem: [Folklore] There exists a graph G=(U [ V, E) such that – G is a (K, ° )-expander – G is D-regular – ° = (1- ² ) ¢ D – D = O( log(|U|/|V|) / ² ) – |V| = O( K ¢ D/ ² ) The best possible here is D G is called a lossless expander

Probabilistic Construction Revised Definition: G is a (K, ° )-expander if | ¡ (S)| ¸ ° ¢ |S| 8 S µ U s.t. |S| · K Theorem: [Folklore] There exists a graph G=(U [ V, E) such that – G is a (K, ° )-expander – G is D-regular – ° = (1- ² ) ¢ D – D = O( log(|U|/|V|) / ² ) – |V| = O( K ¢ D/ ² ) Common Parameters |U| = 1.3n |V| = n K = n/500 ° = 20 D = 32 ² = Can be constructed explicitly! (constants maybe slightly worse) [Capalbo, Reingold, Vadhan, Wigderson]

Probabilistic Construction Revised Definition: G is a (K, ° )-expander if | ¡ (S)| ¸ ° ¢ |S| 8 S µ U s.t. |S| · K Theorem: [Folklore] There exists a graph G=(U [ V, E) such that – G is a (K, ° )-expander – G is D-regular – ° = (1- ² ) ¢ D – D = O( log(|U|/|V|) / ² ) – |V| = O( K ¢ D/ ² ) Our Parameters |U| = n log n |V| = n K = n 1/3 ° = D - log 2 n D = n 1/3 ¢ log 2 n ² = 1/n 1/3 No known explicit construction Very unbalanced Very large degreeVery large expansion

V f = 2T2T ½ Constructing Bumps from Expanders ; Need map ½ : 2 T ! 2 V and ® < ¯ s.t. 8 f : 2 T ! {0,1} there is a submodular function f : 2 V ! R + s.t. f(S)=0 ) f( ½ (S))= ® and f(S)=1 ) f( ½ (S))= ¯ ~~ ~

V f = 2 T = U Constructing Bumps from Expanders Expander ; Need map ½ : 2 T ! 2 V and ® < ¯ s.t. 8 f : 2 T ! {0,1} there is a submodular function f : 2 V ! R + s.t. f(S)=0 ) f( ½ (S))= ® and f(S)=1 ) f( ½ (S))= ¯ Use expander with U=2 T. For S 2 2 T, set ½ (S)= … ~~ ~ V

Need map ½ : 2 T ! 2 V and ® < ¯ s.t. 8 f : 2 T ! {0,1} there is a submodular function f : 2 V ! R + s.t. f(S)=0 ) f( ½ (S))= ® and f(S)=1 ) f( ½ (S))= ¯ Use expander with U=2 T. For S 2 2 T, set ½ (S)= ¡ (S) Theorem: Using expanders, a map ½ exists with |V|=n, |T|=  (log 2 n), ® =log 2 n, and ¯ =n 1/3. Corollary: Learning submodular functions has error  (n 1/3 ) ; V f = 2 T = U ~~ ~ S ¡ (S) ~ Constructing Bumps from Expanders V

What are these matroids? For every S µ T with f(S)=0, define A S = ¡ (S) Define I = { I : |I Å A S | · ® 8 S }. Is this a matroid? No! If A i ’s disjoint, this is a partition matroid. If A i ’s overlap, easy counterexample: Let X={a,b}, Y={b,c} I = { I : |I Å X| · 1 and |I Å Y| · 1} Then {a,c} and {b} are both maximal sets in I ) not a matroid

What are these matroids? For every S µ T with f(S)=0, define A S = ¡ (S) Define I = { I : |I Å A S | · ® 8 S }. Is this a matroid? No! If A i ’s disjoint, this is a partition matroid. But! If A i ’s almost disjoint, then I is almost a matroid. Define I = { I : |I Å [ S 2T A S | · | T | ® +| [ S 2T A S |-∑ S 2T |A S | 8T µ 2 T } Thm: (V, I ) is a matroid. This generalizes partition matroids, laminar matroids, paving matroids... ≈0, by expansion ¡ (T )¡ (T )| ¡ ( T )| -∑ S 2T | ¡ (S)|

A General Upper Bound? Theorem: (Our lower bound) Any algorithm for learning a submodular function w.r.t. an arbitrary distribution must have approximation factor  (n 1/3 ). If we’re aiming for such a large approximation, surely there’s an algorithm that can achieve it?

A General Upper Bound? Theorem: (Our lower bound) Any algorithm for learning a submodular function w.r.t. an arbitrary distribution must have approximation factor  (n 1/3 ). Theorem: (Our upper bound) 9 an algorithm for learning a submodular function w.r.t. an arbitrary distribution that has approximation factor  (n 1/2 ).

Computing Linear Separators + – – – – – + – + + – – – Given {+,–}-labeled points in R n, find a hyperplane c T x = b that separates the +s and –s. Easily solved by linear programming.

Learning Linear Separators + – – – – – + – + + – – – Given random sample of {+,–}-labeled points in R n, find a hyperplane c T x = b that separates most of the +s and –s. Classic machine learning problem. Error!

Learning Linear Separators + – – – – – + – + + – – – Classic Theorem: [Vapnik-Chervonenkis 1971?] O( n/ ² 2 ) samples suffice to get error ². Error! ~

Submodular Functions are Approximately Linear Let f be non-negative, monotone and submodular Claim: f can be approximated to within factor n by a linear function g. Proof Sketch: Let g(S) = § s 2 S f({s}). Then f(S) · g(S) · n ¢ f(S). Submodularity: f(S)+f(T) ¸ f(S Å T)+f(S [ T) 8 S,T µ V Monotonicity: f(S) · f(T) 8 S µ T Non-negativity: f(S) ¸ 0 8 S µ V

V Submodular Functions are Approximately Linear f n¢fn¢f g

V f n¢fn¢f Randomly sample {S 1,…,S k } from distribution Create + for f(S i ) and – for n ¢ f(S i ) Now just learn a linear separator! – – – – – – – – – – – – – – g

V f n¢fn¢f Theorem: g approximates f to within a factor n on a 1- ² fraction of the distribution. Can improve to factor O(n 1/2 ) by approximating submodular polyhedron by minimum volume ellipsoid. g

Summary PMAC model for learning real-valued functions Learning under arbitrary distributions: – Factor O(n 1/2 ) algorithm – Factor  (n 1/3 ) hardness (info-theoretic) Learning under product distributions: – Factor O(log(1/ ² )) algorithm New general family of matroids – Generalizes partition matroids to non-disjoint parts

Open Questions Improve  (n 1/3 ) lower bound to  (n 1/2 ) Explicit construction of expanders Non-monotone submodular functions – Any algorithm? – Lower bound better than  (n 1/3 ) For algorithm under uniform distribution, relax 1-Lipschitz condition