Download presentation
Presentation is loading. Please wait.
Published byJoy Fleming Modified over 6 years ago
1
Generalization and adaptivity in stochastic convex optimization
Vitaly Feldman IBM Research - Almaden TOCA-SV 2016
2
Stochastic convex optimization
Convex body πΎβ β π Class πΉ of convex functions over πΎ Given π 1 ,β¦, π π sampled i.i.d. from π· over πΉ Minimize π π· β π πβΌπ· [π] over πΎ: Find π₯ s.t. π π· π₯ β€ min π₯βπΎ π π· π₯ +π with high prob. π 1 π 2 β¦ π π In SCO the goal is to optimize the (pointwise) expected function from samples. D and hence f_D is unknown π π· π₯
3
Applications Machine learning and statistics π π π₯ =πΏ β π₯ π§ π , π¦ π
Supervised learning: Dataset: π§ 1 , π¦ 1 ,β¦, π§ π , π¦ π βΌπ· Hypotheses: π»= β π₯ π₯βπΎ} Minimize true loss: π π§,π¦ βΌπ· πΏ β π₯ π§ , π¦ π π π₯ =πΏ β π₯ π§ π , π¦ π π π· π₯ = π π§,π¦ βΌπ· πΏ β π₯ π§ , π¦ is the true loss of β π₯ unsupervised learning, density estimation etc.
4
ERM Does ERM generalize: π π· π₯ = ?
Empirical risk minimization (ERM) Stochastic average approximation Given π= π 1 ,β¦, π π Define empirical average: π π (π₯)β 1 π πβ π π π (π₯) Minimize π π over πΎ: π₯ β argmin π₯βπΎ π π π₯ π 1 π 2 β¦ π π π π π₯ π₯ Does ERM generalize: π π· π₯ = ?
5
Uniform convergence β π₯βπΎ, π π π₯ β π π· π₯ β€π If π π π₯ β€ min π₯βπΎ π π π₯ +πΌ then π π· π₯ β€ min π₯βπΎ π π· π₯ +πΌ+2π π π· +π π π· βπ π π· π π π₯ Generalization guarantees for ERM are usually based on uniform convergence. For many problems the number of samples that is necessary for learning is sufficient for uniform convergence (up to constant factors) Often known to be optimal (e.g. VC dimension)
6
ERM for SCO: the gap β 2 Lipschitz bounded SCO ERM upper bounds:
πΎβ πΉ 2 π 1 β π₯ π₯ 2 β€1} πΉ= π convex βπ₯,π¦βπΎ, π π₯ βπ π¦ β€ π₯βπ¦ 2 ERM upper bounds: π= π π π or πβ π π via uniform convergence Can be solved using π=π 1 π 2 samples Stochastic gradient descent [Robbins,Monro β51; Polyak β90]
7
ERM lower bounds Can the ERM upper bounds be improved?
Linear models: π=π 1 π 2 via uniform convergence [Kakade,Sridharan,Tewari β08] Strongly convex: π=π 1 π 2 via stability [Shalev-Shwartz,Shamir,Sridharan,Srebro β09] ERM might require π=Ξ© log π π samples [S5 β09] ERM might require π=Ξ© π π samples [F β16]
8
Construction π: πΉ 2 π 1 β[0,1] with an exponential number of independent maxima π is the set of maxima π = 2 π/6 and βπ€βπ, π π€ = 1 8 π π where πβ π is the set of maxima βπ£βπ, π π π£ = 1 8 βπ€βπβπ, π π π£ =0
9
Distribution π·: π π where π is a random subset of π If πβ€π/6 then with prob. β₯1/2 exists π€βπ: π π π€ =0 and π π· π€ =1/16 π π· =π/2 Can be made efficient: π· supported over π functions computable in π(π) time [F β16]
10
Algorithms & generalization
Which optimization algorithms generalize well? Does gradient descent on π π generalize? Projected gradient descent: Initialize π₯ 1 βπΎ For π‘=1 to π π₯ π‘+1 = Ξ πΎ π₯ π‘ βππ» π π π₯ π‘ Output: 1 π π‘ π₯ π‘ 1 π π π» π π ( π₯ π‘ ) Points at which gradient is estimated as chosen adaptively and therefore we cannot rely on independence which is necessary for concentration results. This type of problems is addressed by adaptive data analysis For fixed π₯: 1 π π π» π π (π₯) βπ» π π· (π₯) 2 β 1 π with high prob. But π₯ π‘ depends on π!
11
Adaptive data analysis
π΄ 1 π π£ 1 π΄ 2 π£ 2 π΄ π π£ π Data analyst(s) How to ensure generalization? Differential privacy/Information-theoretic stability [DFHPRR β14; BNSSSU β15; BF β16; RRTWX β16] Information bounds [DFHPRR β15; RZ β16; RRST β16] Lower bounds [HU β14; SU β15] Data analysis consists of many steps in which the analysis at each step depends on the results of previous steps. Ensuring generalization is hard even if each individual analysis generalizes when executed on independent data points.
12
Summary Cannot rely on uniform convergence/ERM for high- dimensional SCO The choice of optimization algorithm affects generalization Need better algorithms and analysis tools for dealing with adaptivity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.