Understanding Generalization in Adaptive Data Analysis Vitaly Feldman
Overview Adaptive data analysis New results [F, Steinke 17] Motivation Definitions Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results [F, Steinke 17] Open problems
𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]=? Learning problem Model Data 𝑓=𝐴(𝑆) Analysis 𝐴 Distribution 𝑃 over domain 𝑋 XGBoost SVRG Adagrad SVM Analysis 𝑓=𝐴(𝑆) 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝐴
Statistical inference Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Model complexity Rademacher compl. Stability Online-to-batch … Algorithm 𝐴 Generalization guarantees for 𝑓 𝑓=𝐴(𝑆)
Data analysis is adaptive Steps depend on previous analyses of the same dataset 𝐴 1 𝑆 𝑣 1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s)
Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman, 1992]
ML practice Testing Training 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)] Data Data Data Data Training Testing Lasso k-NN SVM C4.5 Kernels 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]
ML practice now Testing Training Validation 𝜃 𝑓 𝜃 𝑓 Test error of 𝑓 XGBoost SVRG Tensorflow Testing Data Training Validation 𝜃 𝑓 𝜃 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]
Adaptive data analysis [DFHPRR 14] 𝐴 1 Algorithm 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) Goal: given 𝑆 compute 𝑣 𝑖 ’s “close” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries
Adaptive statistical queries Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) 𝐴 𝑖 (𝑆)≡ 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝜙 𝑖 :𝑋→ 0,1 Example: 𝜙 𝑖 =𝐿𝑜𝑠𝑠(𝑓,𝑥) 𝑣 𝑖 − 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 ≤τ with prob. 1−𝛽 Can measure correlations, moments, accuracy/loss Run any statistical query algorithm
Answering non-adaptive SQs Given 𝑘 non-adaptive query functions 𝜙 1 ,… ,𝜙 𝑘 and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 Use empirical mean: 𝐄 𝑆 𝜙 𝑖 = 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝑛=𝑂 log (𝑘/𝛽) 𝜏 2
Answering adaptively-chosen SQs What if we use 𝐄 𝑆 𝜙 𝑖 ? For some constant 𝛽>0: 𝑛≥ 𝑘 𝜏 2 Variable selection, boosting, bagging, step-wise regression .. Data splitting: 𝑛=𝑂 𝑘⋅log 𝑘 𝜏 2
Answering adaptive SQs [DFHPRR 14] Exists an algorithm that can answer 𝑘 adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 𝑘 𝜏 2.5 Data splitting: 𝑂 𝑘 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 𝑘 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 − 𝐴 𝑖 𝑆 ′ ≤ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 𝑆∼ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏
Differential privacy [Dwork,McSherry,Nissim,Smith 06] ratio bounded M 𝑆 𝑆′ Randomized algorithm 𝑀 is (𝜖,𝛿)-differentially private if for any two data sets 𝑆,𝑆′ that differ in one element: ∀𝑍⊆range 𝑀 , Pr 𝑴 𝑴 𝑆 ∈𝑍 ≤ 𝑒 𝜖 ⋅ Pr 𝑴 𝑴 𝑆 ′ ∈𝑍 +𝛿
Differential privacy is stability DP implies generalization DP composes adaptively Differential privacy is stability Implies strongly uniform replace-one stability and generalization in expectation DP implies generalization with high probability [DFHPRR 14, BNSSSU 15] Composition of 𝑘 𝜖-DP algorithms: for every 𝛿>0, is 𝜖 𝑘 log 1/𝛿 ,𝛿 -DP [Dwork,Rothblum,Vadhan 10]
Value perturbation [DMNS 06] Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error ≈Δ(𝐴)⋅ 𝑛 ⋅ 𝑘 1 4 where Δ(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 −𝐴( 𝑆 ′ ) Δ(𝐴)⋅ 𝑛 could be much larger than standard deviation of 𝐴 max 𝑆,𝑆′ 𝐴 𝑆 −𝐴 𝑆 ′ ≤1/𝑛 Gaussian 𝑁(0,𝜏)
Beyond low-sensitivity [F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 𝑘 : 𝑋 𝑡 →ℝ given 𝑛= 𝑂 𝑘 ⋅ 𝑡 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 𝑘 such that w.h.p. for all 𝑖: 𝐄 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆 − 𝑣 𝑖 ≤2 𝜎 𝑖 where 𝜎 𝑖 = 𝐕𝐚𝐫 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆 For statistical queries: 𝜙 𝑖 :𝑋→[−𝐵,𝐵] given 𝑛 samples get error that scales as 𝐕𝐚𝐫 𝑥∼𝑃 𝜙 𝑖 𝑥 𝑛 ⋅ 𝑘 1/4 Value perturbation: 𝐵 𝑛 ⋅ 𝑘 1/4
Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 ⋯ 𝑆 𝑚 𝑛=𝑡𝑚 ⋯ 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚−2 𝑦 𝑚−1 𝑈 Find an approximate median with DP relative to 𝑈 value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in 𝑈 𝑣
Median algorithms Exponential mechanism [McSherry, Talwar 07] Requires discretization: ground set 𝑇, |𝑇|=𝑟 Upper bound: 2 𝑂( log ∗ 𝑟) samples Lower bound: Ω( log ∗ 𝑟) samples [Bun,Nissim,Stemmer,Vadhan 15] 𝑈 𝑇 Exponential mechanism [McSherry, Talwar 07] Output 𝑣∈𝑇 with prob. ∝ 𝑒 −𝜖 # 𝑦∈𝑈 𝑣≤𝑦 − 𝑚 2 Uses 𝑂 log 𝑟 𝜖 samples Stability and confidence amplification for the price of one log factor!
Analysis Differential privacy approximately preserves quantiles If 𝑣 is within 1 3 , 2 3 empirical quantiles then 𝑣 is within 1 4 , 3 4 true quantiles 𝑣 is within mean ±2𝜎 If 𝜙 is well-concentrated on 𝐷 then easy to prove high probability bounds [F, Steinke 17] Let 𝑀 be a DP algorithm that on input 𝑈∈ 𝑌 𝑚 outputs a function 𝜙:𝑌→ℝ and a value 𝑣∈ℝ. Then w.h.p. over 𝑈∼ 𝐷 𝑚 and 𝜙,𝑣 ←𝑀 𝑈 : Pr 𝑦∼𝐷 𝑣≤𝜙(𝑦) ≈ Pr 𝑦∼𝑈 𝑣≤𝜙(𝑦)
Limits Any algorithm for answering 𝑘 adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ω( 𝑘 /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: 𝑛=𝑂( 𝑐 log 𝑘) where 𝑐 is the number of queries that failed verification Data splitting if overfitting [DFHPRR 14] Reusable holdout [DFHPRR 15] Maintaining public leaderboard in a competition [Blum, Hardt 15]
Open problems Analysts without side information about 𝑃 Queries depend only on previous answers Fixed “natural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an SQ analyst whose queries require more than 𝑂( log 𝑘) samples to answer? (with 0.1 accuracy/confidence)
Stochastic convex optimization Convex body 𝐾= 𝔹 2 𝑑 1 ≐ 𝑥 𝑥 2 ≤1} Class 𝐹 of convex 1-Lipschitz functions 𝐹= 𝑓 convex ∀𝑥∈𝐾, 𝛻𝑓(𝑥) 2 ≤1 Given 𝑓 1 ,…, 𝑓 𝑛 sampled i.i.d. from unknown 𝑃 over 𝐹 Minimize true (expected) objective: 𝑓 𝑃 (𝑥)≐ 𝐄 𝑓∼𝑃 [𝑓(𝑥)] over 𝐾: Find 𝑥 s.t. 𝑓 𝑃 𝑥 ≤ min 𝑥∈𝐾 𝑓 𝑃 𝑥 +𝜖 𝑓 1 𝑓 2 … 𝑓 𝑛 𝑓 𝑃 𝑥
Gradient descent ERM via projected gradient descent: 𝑓 𝑆 (𝑥)≐ 1 𝑛 𝑓 𝑖 (𝑥) Initialize 𝑥 1 ∈𝐾 For 𝑡=1 to 𝑇 𝑥 𝑡+1 = Project 𝐾 𝑥 𝑡 −𝜂⋅𝛻 𝑓 𝑆 𝑥 𝑡 Output: 1 𝑇 𝑡 𝑥 𝑡 1 𝑛 𝛻𝑓 𝑖 ( 𝑥 𝑡 ) Fresh samples: 𝛻 𝑓 𝑆 𝑥 𝑡 −𝛻 𝑓 𝑃 𝑥 𝑡 2 ≤1/ 𝑛 Sample complexity is unknown Uniform convergence: 𝑂 𝑑 𝜖 2 samples (tight [F. 16]) SGD solves using 𝑂 1 𝜖 2 samples [Robbins,Monro 51; Polyak 90] Overall: 𝑑/ 𝜖 2 statistical queries with accuracy 𝜖 in 1/ 𝜖 2 adaptive rounds Sample splitting: 𝑂 log 𝑑 𝜖 4 samples DP: 𝑂 𝑑 𝜖 3 samples
Conclusions Real-valued analyses (without any assumptions) Going beyond tools from DP Other notions of stability for outcomes Max/mutual information Generalization beyond uniform convergence Using these techniques in practice