Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generalization and adaptivity in stochastic convex optimization

Similar presentations


Presentation on theme: "Generalization and adaptivity in stochastic convex optimization"β€” Presentation transcript:

1 Generalization and adaptivity in stochastic convex optimization
Vitaly Feldman IBM Research - Almaden TOCA-SV 2016

2 Stochastic convex optimization
Convex body πΎβŠ† ℝ 𝑑 Class 𝐹 of convex functions over 𝐾 Given 𝑓 1 ,…, 𝑓 𝑛 sampled i.i.d. from 𝐷 over 𝐹 Minimize 𝑓 𝐷 ≐ 𝐄 π‘“βˆΌπ· [𝑓] over 𝐾: Find π‘₯ s.t. 𝑓 𝐷 π‘₯ ≀ min π‘₯∈𝐾 𝑓 𝐷 π‘₯ +πœ– with high prob. 𝑓 1 𝑓 2 … 𝑓 𝑛 In SCO the goal is to optimize the (pointwise) expected function from samples. D and hence f_D is unknown 𝑓 𝐷 π‘₯

3 Applications Machine learning and statistics 𝑓 𝑖 π‘₯ =𝐿 β„Ž π‘₯ 𝑧 𝑖 , 𝑦 𝑖
Supervised learning: Dataset: 𝑧 1 , 𝑦 1 ,…, 𝑧 𝑛 , 𝑦 𝑛 ∼𝐷 Hypotheses: 𝐻= β„Ž π‘₯ π‘₯∈𝐾} Minimize true loss: 𝐄 𝑧,𝑦 ∼𝐷 𝐿 β„Ž π‘₯ 𝑧 , 𝑦 𝑓 𝑖 π‘₯ =𝐿 β„Ž π‘₯ 𝑧 𝑖 , 𝑦 𝑖 𝑓 𝐷 π‘₯ = 𝐄 𝑧,𝑦 ∼𝐷 𝐿 β„Ž π‘₯ 𝑧 , 𝑦 is the true loss of β„Ž π‘₯ unsupervised learning, density estimation etc.

4 ERM Does ERM generalize: 𝑓 𝐷 π‘₯ = ?
Empirical risk minimization (ERM) Stochastic average approximation Given 𝑆= 𝑓 1 ,…, 𝑓 𝑛 Define empirical average: 𝑓 𝑆 (π‘₯)≐ 1 𝑛 π‘–βˆˆ 𝑛 𝑓 𝑖 (π‘₯) Minimize 𝑓 𝑆 over 𝐾: π‘₯ ∈ argmin π‘₯∈𝐾 𝑓 𝑆 π‘₯ 𝑓 1 𝑓 2 … 𝑓 𝑛 𝑓 𝑆 π‘₯ π‘₯ Does ERM generalize: 𝑓 𝐷 π‘₯ = ?

5 Uniform convergence βˆ€ π‘₯∈𝐾, 𝑓 𝑆 π‘₯ βˆ’ 𝑓 𝐷 π‘₯ β‰€πœ– If 𝑓 𝑆 π‘₯ ≀ min π‘₯∈𝐾 𝑓 𝑆 π‘₯ +𝛼 then 𝑓 𝐷 π‘₯ ≀ min π‘₯∈𝐾 𝑓 𝐷 π‘₯ +𝛼+2πœ– 𝑓 𝐷 +πœ– 𝑓 𝐷 βˆ’πœ– 𝑓 𝐷 𝑓 𝑆 π‘₯ Generalization guarantees for ERM are usually based on uniform convergence. For many problems the number of samples that is necessary for learning is sufficient for uniform convergence (up to constant factors) Often known to be optimal (e.g. VC dimension)

6 ERM for SCO: the gap β„“ 2 Lipschitz bounded SCO ERM upper bounds:
πΎβŠ† 𝔹 2 𝑑 1 ≐ π‘₯ π‘₯ 2 ≀1} 𝐹= 𝑓 convex βˆ€π‘₯,π‘¦βˆˆπΎ, 𝑓 π‘₯ βˆ’π‘“ 𝑦 ≀ π‘₯βˆ’π‘¦ 2 ERM upper bounds: 𝑛= 𝑂 𝑑 πœ– or πœ–β‰ˆ 𝑑 𝑛 via uniform convergence Can be solved using 𝑛=𝑂 1 πœ– 2 samples Stochastic gradient descent [Robbins,Monro β€˜51; Polyak β€˜90]

7 ERM lower bounds Can the ERM upper bounds be improved?
Linear models: 𝑛=𝑂 1 πœ– 2 via uniform convergence [Kakade,Sridharan,Tewari β€˜08] Strongly convex: 𝑛=𝑂 1 πœ– 2 via stability [Shalev-Shwartz,Shamir,Sridharan,Srebro β€˜09] ERM might require 𝑛=Ξ© log 𝑑 πœ– samples [S5 β€˜09] ERM might require 𝑛=Ξ© 𝑑 πœ– samples [F β€˜16]

8 Construction 𝑔: 𝔹 2 𝑑 1 β†’[0,1] with an exponential number of independent maxima π‘Š is the set of maxima π‘Š = 2 𝑑/6 and βˆ€π‘€βˆˆπ‘Š, 𝑔 𝑀 = 1 8 𝑔 𝑉 where π‘‰βŠ† π‘Š is the set of maxima βˆ€π‘£βˆˆπ‘‰, 𝑔 𝑉 𝑣 = 1 8 βˆ€π‘€βˆˆπ‘Šβˆ–π‘‰, 𝑔 𝑉 𝑣 =0

9 Distribution 𝐷: 𝑔 𝑉 where 𝑉 is a random subset of π‘Š If 𝑛≀𝑑/6 then with prob. β‰₯1/2 exists π‘€βˆˆπ‘Š: 𝑓 𝑆 𝑀 =0 and 𝑓 𝐷 𝑀 =1/16 𝑓 𝐷 =𝑔/2 Can be made efficient: 𝐷 supported over 𝑑 functions computable in 𝑂(𝑑) time [F β€˜16]

10 Algorithms & generalization
Which optimization algorithms generalize well? Does gradient descent on 𝑓 𝑆 generalize? Projected gradient descent: Initialize π‘₯ 1 ∈𝐾 For 𝑑=1 to 𝑇 π‘₯ 𝑑+1 = Ξ  𝐾 π‘₯ 𝑑 βˆ’πœ‚π›» 𝑓 𝑆 π‘₯ 𝑑 Output: 1 𝑇 𝑑 π‘₯ 𝑑 1 𝑛 𝑖 𝛻 𝑓 𝑖 ( π‘₯ 𝑑 ) Points at which gradient is estimated as chosen adaptively and therefore we cannot rely on independence which is necessary for concentration results. This type of problems is addressed by adaptive data analysis For fixed π‘₯: 1 𝑛 𝑖 𝛻 𝑓 𝑖 (π‘₯) βˆ’π›» 𝑓 𝐷 (π‘₯) 2 β‰ˆ 1 𝑛 with high prob. But π‘₯ 𝑑 depends on 𝑆!

11 Adaptive data analysis
𝐴 1 𝑆 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘š 𝑣 π‘š Data analyst(s) How to ensure generalization? Differential privacy/Information-theoretic stability [DFHPRR ’14; BNSSSU ’15; BF ’16; RRTWX β€˜16] Information bounds [DFHPRR ’15; RZ ’16; RRST β€˜16] Lower bounds [HU ’14; SU β€˜15] Data analysis consists of many steps in which the analysis at each step depends on the results of previous steps. Ensuring generalization is hard even if each individual analysis generalizes when executed on independent data points.

12 Summary Cannot rely on uniform convergence/ERM for high- dimensional SCO The choice of optimization algorithm affects generalization Need better algorithms and analysis tools for dealing with adaptivity


Download ppt "Generalization and adaptivity in stochastic convex optimization"

Similar presentations


Ads by Google