Generalization and adaptivity in stochastic convex optimization

Generalization and adaptivity in stochastic convex optimization
Vitaly Feldman IBM Research - Almaden TOCA-SV 2016

Stochastic convex optimization
Convex body 𝐾⊆ ℝ 𝑑 Class 𝐹 of convex functions over 𝐾 Given 𝑓 1 ,…, 𝑓 𝑛 sampled i.i.d. from 𝐷 over 𝐹 Minimize 𝑓 𝐷 ≐ 𝐄 𝑓∼𝐷 [𝑓] over 𝐾: Find 𝑥 s.t. 𝑓 𝐷 𝑥 ≤ min 𝑥∈𝐾 𝑓 𝐷 𝑥 +𝜖 with high prob. 𝑓 1 𝑓 2 … 𝑓 𝑛 In SCO the goal is to optimize the (pointwise) expected function from samples. D and hence f_D is unknown 𝑓 𝐷 𝑥

Applications Machine learning and statistics 𝑓 𝑖 𝑥 =𝐿 ℎ 𝑥 𝑧 𝑖 , 𝑦 𝑖
Supervised learning: Dataset: 𝑧 1 , 𝑦 1 ,…, 𝑧 𝑛 , 𝑦 𝑛 ∼𝐷 Hypotheses: 𝐻= ℎ 𝑥 𝑥∈𝐾} Minimize true loss: 𝐄 𝑧,𝑦 ∼𝐷 𝐿 ℎ 𝑥 𝑧 , 𝑦 𝑓 𝑖 𝑥 =𝐿 ℎ 𝑥 𝑧 𝑖 , 𝑦 𝑖 𝑓 𝐷 𝑥 = 𝐄 𝑧,𝑦 ∼𝐷 𝐿 ℎ 𝑥 𝑧 , 𝑦 is the true loss of ℎ 𝑥 unsupervised learning, density estimation etc.

ERM Does ERM generalize: 𝑓 𝐷 𝑥 = ?
Empirical risk minimization (ERM) Stochastic average approximation Given 𝑆= 𝑓 1 ,…, 𝑓 𝑛 Define empirical average: 𝑓 𝑆 (𝑥)≐ 1 𝑛 𝑖∈ 𝑛 𝑓 𝑖 (𝑥) Minimize 𝑓 𝑆 over 𝐾: 𝑥 ∈ argmin 𝑥∈𝐾 𝑓 𝑆 𝑥 𝑓 1 𝑓 2 … 𝑓 𝑛 𝑓 𝑆 𝑥 𝑥 Does ERM generalize: 𝑓 𝐷 𝑥 = ?

Uniform convergence ∀ 𝑥∈𝐾, 𝑓 𝑆 𝑥 − 𝑓 𝐷 𝑥 ≤𝜖 If 𝑓 𝑆 𝑥 ≤ min 𝑥∈𝐾 𝑓 𝑆 𝑥 +𝛼 then 𝑓 𝐷 𝑥 ≤ min 𝑥∈𝐾 𝑓 𝐷 𝑥 +𝛼+2𝜖 𝑓 𝐷 +𝜖 𝑓 𝐷 −𝜖 𝑓 𝐷 𝑓 𝑆 𝑥 Generalization guarantees for ERM are usually based on uniform convergence. For many problems the number of samples that is necessary for learning is sufficient for uniform convergence (up to constant factors) Often known to be optimal (e.g. VC dimension)

ERM for SCO: the gap ℓ 2 Lipschitz bounded SCO ERM upper bounds:
𝐾⊆ 𝔹 2 𝑑 1 ≐ 𝑥 𝑥 2 ≤1} 𝐹= 𝑓 convex ∀𝑥,𝑦∈𝐾, 𝑓 𝑥 −𝑓 𝑦 ≤ 𝑥−𝑦 2 ERM upper bounds: 𝑛= 𝑂 𝑑 𝜖 or 𝜖≈ 𝑑 𝑛 via uniform convergence Can be solved using 𝑛=𝑂 1 𝜖 2 samples Stochastic gradient descent [Robbins,Monro ‘51; Polyak ‘90]

ERM lower bounds Can the ERM upper bounds be improved?
Linear models: 𝑛=𝑂 1 𝜖 2 via uniform convergence [Kakade,Sridharan,Tewari ‘08] Strongly convex: 𝑛=𝑂 1 𝜖 2 via stability [Shalev-Shwartz,Shamir,Sridharan,Srebro ‘09] ERM might require 𝑛=Ω log 𝑑 𝜖 samples [S5 ‘09] ERM might require 𝑛=Ω 𝑑 𝜖 samples [F ‘16]

Construction 𝑔: 𝔹 2 𝑑 1 →[0,1] with an exponential number of independent maxima 𝑊 is the set of maxima 𝑊 = 2 𝑑/6 and ∀𝑤∈𝑊, 𝑔 𝑤 = 1 8 𝑔 𝑉 where 𝑉⊆ 𝑊 is the set of maxima ∀𝑣∈𝑉, 𝑔 𝑉 𝑣 = 1 8 ∀𝑤∈𝑊∖𝑉, 𝑔 𝑉 𝑣 =0

Distribution 𝐷: 𝑔 𝑉 where 𝑉 is a random subset of 𝑊 If 𝑛≤𝑑/6 then with prob. ≥1/2 exists 𝑤∈𝑊: 𝑓 𝑆 𝑤 =0 and 𝑓 𝐷 𝑤 =1/16 𝑓 𝐷 =𝑔/2 Can be made efficient: 𝐷 supported over 𝑑 functions computable in 𝑂(𝑑) time [F ‘16]

Algorithms & generalization
Which optimization algorithms generalize well? Does gradient descent on 𝑓 𝑆 generalize? Projected gradient descent: Initialize 𝑥 1 ∈𝐾 For 𝑡=1 to 𝑇 𝑥 𝑡+1 = Π 𝐾 𝑥 𝑡 −𝜂𝛻 𝑓 𝑆 𝑥 𝑡 Output: 1 𝑇 𝑡 𝑥 𝑡 1 𝑛 𝑖 𝛻 𝑓 𝑖 ( 𝑥 𝑡 ) Points at which gradient is estimated as chosen adaptively and therefore we cannot rely on independence which is necessary for concentration results. This type of problems is addressed by adaptive data analysis For fixed 𝑥: 1 𝑛 𝑖 𝛻 𝑓 𝑖 (𝑥) −𝛻 𝑓 𝐷 (𝑥) 2 ≈ 1 𝑛 with high prob. But 𝑥 𝑡 depends on 𝑆!

Adaptive data analysis
𝐴 1 𝑆 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑚 𝑣 𝑚 Data analyst(s) How to ensure generalization? Differential privacy/Information-theoretic stability [DFHPRR ’14; BNSSSU ’15; BF ’16; RRTWX ‘16] Information bounds [DFHPRR ’15; RZ ’16; RRST ‘16] Lower bounds [HU ’14; SU ‘15] Data analysis consists of many steps in which the analysis at each step depends on the results of previous steps. Ensuring generalization is hard even if each individual analysis generalizes when executed on independent data points.

Summary Cannot rely on uniform convergence/ERM for high- dimensional SCO The choice of optimization algorithm affects generalization Need better algorithms and analysis tools for dealing with adaptivity

Generalization and adaptivity in stochastic convex optimization

Similar presentations

Presentation on theme: "Generalization and adaptivity in stochastic convex optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalization and adaptivity in stochastic convex optimization

Similar presentations

Presentation on theme: "Generalization and adaptivity in stochastic convex optimization"— Presentation transcript:

Similar presentations

About project

Feedback