Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vitaly (the West Coast) Feldman

Similar presentations


Presentation on theme: "Vitaly (the West Coast) Feldman"β€” Presentation transcript:

1 Vitaly (the West Coast) Feldman
Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman

2 Overview Adaptive data analysis New results (with Thomas Steinke)
Motivation Framework Basic results With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results (with Thomas Steinke) Open problems

3 𝐄 π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]=? Results Data 𝑓=𝐴(𝑆) Analysis 𝐴 Probability
distribution 𝑃 over domain 𝑋 Data Results Analysis 𝑓=𝐴(𝑆) 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝐴

4 Statistical inference
Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch Algorithm 𝐴 Hypothesis test Parameter estimator Classification Generalization guarantees for 𝑓 𝑓=𝐴(𝑆) πΏπ‘œπ‘  𝑠 𝑃 (𝑓)

5 Data analysis is adaptive
Steps depend on previous analyses of the same dataset 𝐴 1 Data pre-processing Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝑆 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ Mention non-adaptive reuse 𝑣 π‘˜ Data analyst(s)

6 Thou shalt not test hypotheses suggested by data
β€œQuiet scandal of statistics” [Leo Breiman, 1992]

7 Reproducibility crisis?
Why Most Published Research Findings Are False [Ioannidis 2005] β€œIrreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes 𝑝-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015]

8 Existing approaches Sample splitting Selective inference
Model selection + parameter estimation Variable selection + regression Pre-registration Β© Center for Open Science

9 Adaptive data analysis [DFHPRR 14]
𝐴 1 Algorithm 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) Need both tools for analysis and ways to make algorithms that compose better. Goal: given 𝑆 compute 𝑣 𝑖 ’s β€œclose” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries

10 Adaptive statistical queries
Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) 𝐴 𝑖 (𝑆)≑ 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ πœ™ 𝑖 :𝑋→ 0,1 Example: πœ™ 𝑖 =πΏπ‘œπ‘ π‘ (𝑓,π‘₯) 𝑣 𝑖 βˆ’ 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ ≀τ with prob. 1βˆ’π›½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

11 Answering non-adaptive SQs
Given π‘˜ non-adaptive query functions πœ™ 1 ,… ,πœ™ π‘˜ and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ Use empirical mean: 𝐄 𝑆 πœ™ 𝑖 = 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ 𝑛=𝑂 log (π‘˜/𝛽) 𝜏 2

12 Answering adaptively-chosen SQs
What if we use 𝐄 𝑆 πœ™ 𝑖 ? For some constant 𝛽>0: 𝑛β‰₯ π‘˜ 𝜏 2 Variable selection, boosting, step-wise regression .. Sample splitting: 𝑛=𝑂 π‘˜β‹…log π‘˜ 𝜏 2

13 Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π‘˜ adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 π‘˜ 𝜏 2.5 Data splitting: 𝑂 π‘˜ 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 π‘˜ 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 βˆ’ 𝐴 𝑖 𝑆 β€² ≀ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 π‘†βˆΌ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏

14 Value perturbation Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁
Laplace/Gaussian

15 Differential privacy [Dwork,McSherry,Nissim,Smith 06]
DP implies generalization Differential privacy is stability If 𝑀 is (πœ–,𝛿)-DP and outputs a function 𝑋→ 0,1 then for every 𝑆,𝑆′,π‘₯ 𝐄 πœ™=𝑀 𝑆 πœ™ π‘₯ βˆ’ 𝐄 πœ™=𝑀 𝑆 β€² πœ™ π‘₯ β‰²πœ–+𝛿 uniform replace-one stability implies generalization in expectation [Bousquet,Elisseeff 02] 𝐄 π‘†βˆΌ 𝑃 𝑛 , πœ™=𝑀 𝑆 𝐄 𝑆 πœ™ βˆ’ 𝐄 π‘†βˆΌ 𝑃 𝑛 , πœ™=𝑀 𝑆 𝐄 𝑃 πœ™ β‰²πœ–+𝛿 DP implies generalization with high probability [DFHPRR 14, BNSSSU 15]

16 Differential privacy [DMNS 06]
DP implies generalization Differential privacy limits information learned about the dataset Max-information: for an algorithm 𝑀: 𝑋 𝑛 β†’π‘Œ and dataset π‘ΊβˆΌπ· over 𝑋 𝑛 I ∞ 𝛽 𝑺;𝑴(𝑺) β‰€π‘˜ if for any event π‘βŠ† 𝑋 𝑛 Γ—π‘Œ Pr π‘ΊβˆΌπ· 𝑺,𝑴 𝑺 βˆˆπ‘ ≀ 𝑒 π‘˜ β‹… Pr 𝑺,π‘»βˆΌπ· 𝑻,𝑴 𝑺 βˆˆπ‘ +𝛽 πœ–-DP bounds max-information [DFHPRR 15] (πœ–,𝛿)-DP bounds max-information for 𝐷= 𝑃 𝑛 [Rogers,Roth,Smith,Thakkar 16]

17 Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively Adaptive composition of π‘˜ (πœ–,𝛿)-DP algorithms is πœ– π‘˜ log 1 𝛿 β€² , 𝛿 β€² +π‘˜π›Ώ -DP For every 𝛿 β€² >0 and πœ–β‰€1/ π‘˜ [Dwork,Rothblum,Vadhan 10]

18 Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively If 𝑀 is β€œaccurate” when fresh samples are used to answer a query differentially private Then 𝑀 is β€œaccurate” when same dataset is reused for adaptively-chosen queries

19 Value perturbation [DMNS 06]
Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error β‰ˆΞ”(𝐴)β‹… 𝑛 β‹… π‘˜ 1 4 where Ξ”(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄( 𝑆 β€² ) Ξ”(𝐴)β‹… 𝑛 could be much larger than standard deviation of 𝐴 on 𝑃 max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄ 𝑆 β€² ≀1/𝑛

20 Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 π‘˜ : 𝑋 𝑑 →ℝ given 𝑛= 𝑂 π‘˜ β‹… 𝑑 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 π‘˜ such that w.h.p. for all 𝑖: 𝐄 π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆 βˆ’ 𝑣 𝑖 ≀2 𝜎 𝑖 where 𝜎 𝑖 = π•πšπ« π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆

21 Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 β‹― 𝑆 π‘š 𝑛=π‘‘π‘š β‹― 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 π‘šβˆ’2 𝑦 π‘šβˆ’1
π‘ˆ Find an approximate median with (weak) DP relative to π‘ˆ value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in π‘ˆ 𝑣

22 Median algorithms Exponential mechanism [McSherry, Talwar 07]
Requires discretization: ground set 𝑇, |𝑇|=π‘Ÿ Upper bound: 2 𝑂( log βˆ— π‘Ÿ) samples Lower bound: Ξ©( log βˆ— π‘Ÿ) samples [Bun,Nissim,Stemmer,Vadhan 15] π‘ˆ 𝑇 Exponential mechanism [McSherry, Talwar 07] Output π‘£βˆˆπ‘‡ with prob. ∝ 𝑒 βˆ’πœ– # π‘¦βˆˆπ‘ˆ 𝑣≀𝑦 βˆ’ π‘š 2 Uses 𝑂 log π‘Ÿ πœ– samples Stability and confidence amplification for the price of one log factor!

23 Limits Any algorithm for answering π‘˜ adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ξ©( π‘˜ /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Analysts without side information about 𝑃? Queries depend only on previous answers Fixed β€œnatural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an analyst whose statistical queries require more than 𝑂( log π‘˜) samples to answer? (with 0.1 accuracy/confidence)

24 ML practice Testing Training Validation πœƒ 𝑓 πœƒ 𝑓 Test error of 𝑓
Data Training Validation πœƒ 𝑓 πœƒ XGBoost SVRG Tensorflow 𝑓 Test error of 𝑓 β‰ˆπ„ π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]

25 Reusable holdout [DFHPRR 15]
Data Training 𝑇 Holdout 𝐻 𝑓 1 πΏπ‘œπ‘ π‘  ( 𝑓 1 ) Reusable holdout algorithm AI guru 𝑓 2 πΏπ‘œπ‘ π‘  ( 𝑓 2 ) 𝑓 π‘˜ πΏπ‘œπ‘ π‘  ( 𝑓 π‘˜ )

26 Reusable holdout [DFHPRR 15, FS 17] Exists an algorithm that can accurately estimate the loss of π‘˜ adaptively chosen functions as long as at most β„“ overfit to the training set for 𝑛 ~ β„“ β‹…log π‘˜ Overfitting: 𝐄 π‘₯βˆΌπ‘‡ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)] ≉𝐄 π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)] Verifying mostly correct answers with DP is cheap Sparse vector technique [Dwork,Naor,Reingold,Rothblum,Vadhan 09]

27 Conclusions Datasets are reused adaptively New conceptual framework
Deep connections to DP Privacy and generalization are aligned Data β€œfreshness” is a limited resource Real-valued analyses (without any assumptions) Going beyond adversarial adaptivity Connections to stability and selective inference Using these techniques in practice


Download ppt "Vitaly (the West Coast) Feldman"

Similar presentations


Ads by Google