Generalization bounds for uniformly stable algorithms Vitaly Feldman Brain with Jan Vondrak
Uniform stability Uniform stability [Bousquet,Elisseeff 02]: Domain 𝑍 (e.g. 𝑋×𝑌) Dataset 𝑆= 𝑧 1 ,…, 𝑧 𝑛 ∈ 𝑍 𝑛 Learning algorithm 𝐴: 𝑍 𝑛 →𝑊 Loss function ℓ:𝑊×𝑍→ ℝ + Uniform stability [Bousquet,Elisseeff 02]: 𝐴 has uniform stability 𝛾 w.r.t. ℓ if For all neighboring 𝑆, 𝑆 ′ ,𝑧∈𝑍 ℓ 𝐴 𝑆 ,𝑧 −ℓ 𝐴 𝑆 ′ ,𝑧 ≤𝛾
Generalization error Probability distribution 𝑃 over 𝑍, 𝑆∼ 𝑃 𝑛 Population loss: 𝐄 𝑃 ℓ 𝑤 = 𝐄 𝑧∼𝑃 ℓ 𝑤,𝑧 Empirical loss: 𝓔 𝑆 ℓ 𝑤 = 1 𝑛 𝑖=1 𝑛 ℓ 𝑤, 𝑧 𝑖 Generalization error/gap: Δ 𝑆 ℓ 𝐴 = 𝐄 𝑃 ℓ 𝐴 𝑆 − 𝓔 𝑆 ℓ 𝐴 𝑆
Stochastic convex optimization 𝑊= 𝔹 2 𝑑 1 ≐ 𝑤 𝑤 2 ≤1} For all 𝑧∈𝑍, ℓ(𝑤,𝑧) is convex 1-Lipschitz in 𝑤 Minimize 𝐄 𝑃 ℓ 𝑤 ≐ 𝐄 𝑧∼𝑃 [ℓ(𝑤,𝑧)] over 𝑊 𝐿 ∗ = min 𝑤∈𝑊 𝐄 𝑃 ℓ 𝑤 For all 𝑃, 𝐴 being SGD with rate 𝜂=1/ 𝑛 : Pr 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 log 1/𝛿 𝑛 ≤𝛿 Uniform convergence error: ≳ 𝑑 𝑛 ERM might have generalization error: ≳ 𝑑 𝑛 [Shalev-Shwartz,Shamir,Srebro,Sridharan ’09; Feldman 16]
Stable optimization Strongly convex ERM [BE 02, SSSS 09] 𝐴 𝜆 (𝑆)=argmi n 𝑤∈𝑊 𝓔 𝑆 ℓ 𝑤 + 𝜆 2 𝑤 2 2 𝐴 𝜆 is 1 𝜆𝑛 uniformly stable and minimizes 𝓔 𝑆 ℓ 𝑤 within 𝜆 2 Gradient descent on smooth losses [Hardt,Recht,Singer 16] 𝐴 𝑇 (𝑆) : 𝑇 steps of GD on 𝓔 𝑆 ℓ 𝑤 with 𝜂= 1 𝑇 𝐴 𝑇 is 𝑇 𝑛 uniformly stable and minimizes 𝓔 𝑆 ℓ 𝑤 within 2 𝑇
Generalization bounds For 𝐴,ℓ w/ range 0,1 and uniform stability 𝛾∈ 1 𝑛 ,1 1 𝑛 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴 ) ≤𝛾 [Rogers,Wagner 78] 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) ≥𝛾 𝒏 log 1/𝛿 ≤𝛿 [Bousquet,Elisseeff 02] Vacuous when 𝛾≥1/ 𝑛 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) ≥ 𝛾 log 1/𝛿 ≤𝛿 NEW!
Comparison [BE 02] This work 𝑛=100 Generalization error 1 𝑛 γ
Second moment 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤ 𝛾 2 + 1 𝑛 TIGHT! NEW! 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤ 𝛾 2 + 1 𝑛 Chebyshev: 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ 𝐴 ≥ 𝛾+1/ 𝑛 𝛿 ≤𝛿 NEW! TIGHT! 1 𝑛 γ Generalization error Previously [Devroye,Wagner 79; BE 02] 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤𝛾+ 1 𝑛 [BE 02] This work
Implications Differentially-private prediction [Dwork,Feldman 18] 𝐏𝐫 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 1 𝛿 1/4 𝑛 ≤𝛿 𝐏𝐫 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 log 1/𝛿 𝑛 1/3 ≤𝛿 For any loss 𝐿:𝑌×𝑌→[0,1], 𝐄 𝐴 [𝐿(𝐴 𝑆,𝑥 ,𝑦)] has uniform stability 𝑒 𝜖 −1≈𝜖 Stronger generalization bounds for DP prediction algorithms Differentially-private prediction [Dwork,Feldman 18] 𝑍=𝑋×𝑌. A randomized algorithm 𝐴(𝑆,𝑥) has 𝜖-DP prediction if for all 𝑆, 𝑆 ′ ,𝑥∈𝑋 𝐷 ∞ 𝐴 𝑆,𝑥 ||𝐴 𝑆 ′ ,𝑥 ≤𝜖
Data-dependent functions Consider 𝑀: 𝑍 𝑛 ×𝑍→ℝ . E.g. 𝑀 𝑆,𝑧 ≡ℓ(𝐴 𝑆 ,𝑧) 𝑀 has uniform stability 𝛾 if For all neighboring 𝑆, 𝑆 ′ ,𝑧∈𝑍 𝑀 𝑆,𝑧 −𝑀 𝑆 ′ ,𝑧 ≤𝛾 𝑀 𝑆,⋅ −𝑀 𝑆 ′ ,⋅ ∞ ≤𝛾 Generalization error/gap: Δ 𝑆 𝑀 = 𝐄 𝑃 𝑀 𝑆 − 𝓔 𝑆 𝑀 𝑆
Generalization in expectation Goal: 𝐄 𝑆∼ 𝑃 𝑛 𝐄 𝑃 𝑀(𝑆) − 𝓔 𝑆 𝑀(𝑆) ≤𝛾 𝐄 𝑆∼ 𝑃 𝑛 𝓔 𝑆 𝑀(𝑆) = 𝐄 𝑆∼ 𝑃 𝑛 ,𝑖∼[𝑛] 𝑀(𝑆, 𝑧 𝑖 ) For all 𝑖 and 𝑧∈𝑍, 𝑀 𝑆, 𝑧 𝑖 ≤𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 𝑀 𝑆, 𝑧 𝑖 ≤ 𝐄 𝑧∼𝑃 𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 𝐄 𝑆∼ 𝑃 𝑛 ,𝑖∼[𝑛] 𝑀(𝑆, 𝑧 𝑖 ) ≤ 𝐄 𝑆∼ 𝑃 𝑛 , 𝑧∼𝑃,𝑖∼ 𝑛 𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 = 𝐄 𝑆∼ 𝑃 𝑛 ,𝑧∼𝑃 𝑀 𝑆,𝑧 +𝛾 = 𝐄 𝑆∼ 𝑃 𝑛 𝐄 𝑃 𝑀(𝑆) +𝛾 𝑆 𝑖←𝑧 , 𝑧 𝑖 ≡ 𝑃 𝑛+1 ≡ 𝑆,𝑧
Concentration via McDiarmid For all neighboring 𝑆, 𝑆 ′ Δ 𝑆 𝑀 − Δ 𝑆 ′ 𝑀 ≤2𝛾+ 1 𝑛 1. 𝐄 𝑧∼𝑃 𝑀 𝑆,𝑧 − 𝐄 𝑧∼𝑃 𝑀 𝑆′,𝑧 ≤𝛾 2. 𝓔 𝑆 𝑀 𝑆 − 𝓔 𝑆 ′ 𝑀 𝑆′,𝑧 ≤ 𝓔 𝑆 𝑀 𝑆 − 𝓔 𝑆 𝑀 𝑆 ′ ,𝑧 + 𝓔 𝑆 𝑀 𝑆 ′ − 𝓔 𝑆 ′ 𝑀 𝑆 ′ ,𝑧 ≤𝛾+ 1 𝑛 McDiarmid: 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 𝑀 ≥𝜇+ 2𝛾+ 1 𝑛 𝒏 log 1/𝛿 ≤𝛿 where 𝜇= 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 𝑀 ≤𝛾
Proof technique UNSTABLE! Based on [Nissim,Stemmer 15; BNSSSU 16] Let 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 for 𝑚= ln 2 𝛿 Need to bound 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 , ℓ=argma x 𝑗 { Δ 𝑆 𝑗 𝑀 } 𝓔 𝑆 ℓ 𝑀 𝑆 ℓ ≈? 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 , ℓ=argma x 𝑗 { Δ 𝑆 𝑗 𝑀 } 𝐄 𝑃 𝑀 𝑆 ℓ Let 𝑄 be a distribution over ℝ: Pr 𝑣∼𝑄 𝑣≥2 𝐄 𝑣 1 ,…, 𝑣 𝑚 ∼𝑄 max 0, 𝑣 1 ,…, 𝑣 𝑚 ≤ ln 2 𝑚 UNSTABLE!
Stable max Exponential mechanism [McSherry,Talwar 07] EM 𝛼 : sample 𝑗∝ 𝑒 𝛼 Δ 𝑆 𝑗 𝑀 Stable: 2𝛼 2𝛾+ 1 𝑛 - differentially private 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 𝓔 𝑆 ℓ 𝑀 𝑆 ℓ ≤ 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 𝐄 𝑃 𝑀 𝑆 ℓ + exp 2𝛼 2𝛾+ 1 𝑛 −1 +𝛾 Approximates max 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 Δ 𝑆 ℓ 𝑀 ≥ 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 − ln 𝑚 𝛼
Game over 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 ≤ ln 𝑚 𝛼 + exp 2𝛼 2𝛾+ 1 𝑛 −1 +𝛾 Pick 𝛼= ln 𝑚 𝛾+1/𝑛 get 𝑂 𝛾+ 1 𝑛 ln 1 𝛿 Let 𝑄 be a distribution over ℝ: Pr 𝑣∼𝑄 𝑣≥2 𝐄 𝑣 1 ,…, 𝑣 𝑚 ∼𝑄 max 0, 𝑣 1 ,…, 𝑣 𝑚 ≤ ln 2 𝑚
Conclusions Better understanding of uniform stability New technique Open Gap between upper and lower bounds High probability generalization without strong uniformity