Vitaly (the West Coast) Feldman

Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman

2 Overview Adaptive data analysis New results (with Thomas Steinke)
Motivation Framework Basic results With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results (with Thomas Steinke) Open problems

3 𝐄 π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]=? Results Data 𝑓=𝐴(𝑆) Analysis 𝐴 Probability
distribution 𝑃 over domain 𝑋 Data Results Analysis 𝑓=𝐴(𝑆) 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝐴

4 Statistical inference
Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch Algorithm 𝐴 Hypothesis test Parameter estimator Classification Generalization guarantees for 𝑓 𝑓=𝐴(𝑆) πΏπ‘œπ‘  𝑠 𝑃 (𝑓)

5 Data analysis is adaptive
Steps depend on previous analyses of the same dataset 𝐴 1 Data pre-processing Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝑆 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ Mention non-adaptive reuse 𝑣 π‘˜ Data analyst(s)

6 Thou shalt not test hypotheses suggested by data
β€œQuiet scandal of statistics” [Leo Breiman, 1992]

7 Reproducibility crisis?
Why Most Published Research Findings Are False [Ioannidis 2005] β€œIrreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes 𝑝-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015]

8 Existing approaches Sample splitting Selective inference
Model selection + parameter estimation Variable selection + regression Pre-registration Β© Center for Open Science

9 Adaptive data analysis [DFHPRR 14]
𝐴 1 Algorithm 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) Need both tools for analysis and ways to make algorithms that compose better. Goal: given 𝑆 compute 𝑣 𝑖 ’s β€œclose” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries

10 Adaptive statistical queries
Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) 𝐴 𝑖 (𝑆)≑ 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ πœ™ 𝑖 :𝑋→ 0,1 Example: πœ™ 𝑖 =πΏπ‘œπ‘ π‘ (𝑓,π‘₯) 𝑣 𝑖 βˆ’ 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ ≀τ with prob. 1βˆ’π›½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

11 Answering non-adaptive SQs
Given π‘˜ non-adaptive query functions πœ™ 1 ,… ,πœ™ π‘˜ and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ Use empirical mean: 𝐄 𝑆 πœ™ 𝑖 = 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ 𝑛=𝑂 log (π‘˜/𝛽) 𝜏 2

12 Answering adaptively-chosen SQs
What if we use 𝐄 𝑆 πœ™ 𝑖 ? For some constant 𝛽>0: 𝑛β‰₯ π‘˜ 𝜏 2 Variable selection, boosting, step-wise regression .. Sample splitting: 𝑛=𝑂 π‘˜β‹…log π‘˜ 𝜏 2

13 Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π‘˜ adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 π‘˜ 𝜏 2.5 Data splitting: 𝑂 π‘˜ 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 π‘˜ 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 βˆ’ 𝐴 𝑖 𝑆 β€² ≀ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 π‘†βˆΌ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏

14 Value perturbation Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁

15 Differential privacy [Dwork,McSherry,Nissim,Smith 06]
DP implies generalization Differential privacy is stability If 𝑀 is (πœ–,𝛿)-DP and outputs a function 𝑋→ 0,1 then for every 𝑆,𝑆′,π‘₯ 𝐄 πœ™=𝑀 𝑆 πœ™ π‘₯ βˆ’ 𝐄 πœ™=𝑀 𝑆 β€² πœ™ π‘₯ β‰²πœ–+𝛿 uniform replace-one stability implies generalization in expectation [Bousquet,Elisseeff 02] 𝐄 π‘†βˆΌ 𝑃 𝑛 , πœ™=𝑀 𝑆 𝐄 𝑆 πœ™ βˆ’ 𝐄 π‘†βˆΌ 𝑃 𝑛 , πœ™=𝑀 𝑆 𝐄 𝑃 πœ™ β‰²πœ–+𝛿 DP implies generalization with high probability [DFHPRR 14, BNSSSU 15]

16 Differential privacy [DMNS 06]
DP implies generalization Differential privacy limits information learned about the dataset Max-information: for an algorithm 𝑀: 𝑋 𝑛 β†’π‘Œ and dataset π‘ΊβˆΌπ· over 𝑋 𝑛 I ∞ 𝛽 𝑺;𝑴(𝑺) β‰€π‘˜ if for any event π‘βŠ† 𝑋 𝑛 Γ—π‘Œ Pr π‘ΊβˆΌπ· 𝑺,𝑴 𝑺 βˆˆπ‘ ≀ 𝑒 π‘˜ β‹… Pr 𝑺,π‘»βˆΌπ· 𝑻,𝑴 𝑺 βˆˆπ‘ +𝛽 πœ–-DP bounds max-information [DFHPRR 15] (πœ–,𝛿)-DP bounds max-information for 𝐷= 𝑃 𝑛 [Rogers,Roth,Smith,Thakkar 16]

17 Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively Adaptive composition of π‘˜ (πœ–,𝛿)-DP algorithms is πœ– π‘˜ log 1 𝛿 β€² , 𝛿 β€² +π‘˜π›Ώ -DP For every 𝛿 β€² >0 and πœ–β‰€1/ π‘˜ [Dwork,Rothblum,Vadhan 10]

18 Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively If 𝑀 is β€œaccurate” when fresh samples are used to answer a query differentially private Then 𝑀 is β€œaccurate” when same dataset is reused for adaptively-chosen queries

19 Value perturbation [DMNS 06]
Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error β‰ˆΞ”(𝐴)β‹… 𝑛 β‹… π‘˜ 1 4 where Ξ”(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄( 𝑆 β€² ) Ξ”(𝐴)β‹… 𝑛 could be much larger than standard deviation of 𝐴 on 𝑃 max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄ 𝑆 β€² ≀1/𝑛

20 Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 π‘˜ : 𝑋 𝑑 →ℝ given 𝑛= 𝑂 π‘˜ β‹… 𝑑 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 π‘˜ such that w.h.p. for all 𝑖: 𝐄 π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆 βˆ’ 𝑣 𝑖 ≀2 𝜎 𝑖 where 𝜎 𝑖 = π•πšπ« π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆

21 Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 β‹― 𝑆 π‘š 𝑛=π‘‘π‘š β‹― 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 π‘šβˆ’2 𝑦 π‘šβˆ’1
π‘ˆ Find an approximate median with (weak) DP relative to π‘ˆ value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in π‘ˆ 𝑣

22 Median algorithms Exponential mechanism [McSherry, Talwar 07]
Requires discretization: ground set 𝑇, |𝑇|=π‘Ÿ Upper bound: 2 𝑂( log βˆ— π‘Ÿ) samples Lower bound: Ξ©( log βˆ— π‘Ÿ) samples [Bun,Nissim,Stemmer,Vadhan 15] π‘ˆ 𝑇 Exponential mechanism [McSherry, Talwar 07] Output π‘£βˆˆπ‘‡ with prob. ∝ 𝑒 βˆ’πœ– # π‘¦βˆˆπ‘ˆ 𝑣≀𝑦 βˆ’ π‘š 2 Uses 𝑂 log π‘Ÿ πœ– samples Stability and confidence amplification for the price of one log factor!

23 Limits Any algorithm for answering π‘˜ adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ξ©( π‘˜ /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Analysts without side information about 𝑃? Queries depend only on previous answers Fixed β€œnatural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an analyst whose statistical queries require more than 𝑂( log π‘˜) samples to answer? (with 0.1 accuracy/confidence)

24 ML practice Testing Training Validation πœƒ 𝑓 πœƒ 𝑓 Test error of 𝑓
Data Training Validation πœƒ 𝑓 πœƒ XGBoost SVRG Tensorflow 𝑓 Test error of 𝑓 β‰ˆπ„ π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]

25 Reusable holdout [DFHPRR 15]
Data Training 𝑇 Holdout 𝐻 𝑓 1 πΏπ‘œπ‘ π‘  ( 𝑓 1 ) Reusable holdout algorithm AI guru 𝑓 2 πΏπ‘œπ‘ π‘  ( 𝑓 2 ) 𝑓 π‘˜ πΏπ‘œπ‘ π‘  ( 𝑓 π‘˜ )

26 Reusable holdout [DFHPRR 15, FS 17] Exists an algorithm that can accurately estimate the loss of π‘˜ adaptively chosen functions as long as at most β„“ overfit to the training set for 𝑛 ~ β„“ β‹…log π‘˜ Overfitting: 𝐄 π‘₯βˆΌπ‘‡ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)] ≉𝐄 π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)] Verifying mostly correct answers with DP is cheap Sparse vector technique [Dwork,Naor,Reingold,Rothblum,Vadhan 09]

27 Conclusions Datasets are reused adaptively New conceptual framework
Deep connections to DP Privacy and generalization are aligned Data β€œfreshness” is a limited resource Real-valued analyses (without any assumptions) Going beyond adversarial adaptivity Connections to stability and selective inference Using these techniques in practice

